MUG'15
Conference Location: Ohio Supercomputer Center Bale Theater
MUG'15 meeting attendees gather for a group photo.
Wednesday, August 19th
12:00 –
Registration
12:00 – 1:00
Lunch
Abstract
A large number of HPC users worldwide are using MVAPICH2 libraries on a daily basis to run their MPI, PGAS, and accelerator/co-processor enabled applications on InfiniBand, iWARP, RoCE, and virtualized clusters. If you are an end-user or system administrator using MVAPICH libraries, have you ever asked some of the following questions to yourself or to your colleagues?
- How can I extract higher performance and scalability from the MVAPICH2 libraries? Which techniques can be used to run my applications faster so that I save on SUs?
- How do I use new features like MPI-3 RMA, MPI-T, non-blocking collectives, etc. in MVAPICH2?
- What are the trade-offs in using different transport protocols (RC, UD, XRC, and Hybrid)?
- How do I use MVAPICH2-GDR on my GPU cluster with GPUDirect RDMA (GDR) technology?
- How does MVAPICH2-X provide support for running multiple programming models with good performance and scalability?
- How do I use MVAPICH2-MIC on my Xeon Phi cluster under different modes?
- What are the benefits of using MVAPICH2-Virt and how do I use it to obtain bare-metal performance on my InfiniBand cluster?
- What is the standard installation and debugging support available in the MVAPICH2 libraries?
6:30 –
Reception Dinner at Miller's Columbus Ale House
1201 Olentangy River Rd
Columbus, OH 43212
Thursday, August 20th
7:45 –
Registration
7:45 – 8:20
Breakfast
8:20 – 8:30
Opening Remarks
Abstract
Why Moore's Law started letting us down, what that means for future systems and architectures, and how MPI will continue to play a significant role in high performance computing through its third decade and beyond. MVAPICH has played a crucial part in the expanding role that MPI has played in exposing new capabilities to developers and users and will continue to be an important system component well into the future.
Bio
Dale Southard is the Principal System Architect in the Office of the CTO for the Tesla computing business at NVIDIA. Having previously worked at SGI, Lawrence Livermore National Laboratory, and the University of Notre Dame, Dale has an extensive background as both a user and a designer of HPC environments.
This talk will provide an overview of the MVAPICH project. Future roadmap and features for upcoming MVAPICH2 releases will be presented.
Abstract
Lawrence Livermore National Laboratory has a noted history of using leadership supercomputing for conducting science in the national interest. For the past eight years, MVAPICH has been key to this effort due to its top-notch performance, scalability, and stability. Future system deployments and applications will demand even more from MPI, but with more work, MVAPICH stands in great position to continue to fill this role. In this talk, I will discuss some of the recent science that MVAPICH has enabled, as well as the capabilities that future systems and applications will require from MVAPICH.
Bio
Adam Moody earned a Master's degree in Computer Science and a Master's degree in Electrical Engineering from The Ohio State University in 2003. Since then, he has worked for Livermore Computing at the Lawrence Livermore National Laboratory, where he supports MPI on large-scale Linux clusters. He leads the Scalable Checkpoint Restart (SCR) library project, which provides fast checkpointing mechanisms for large-scale MPI applications. He also manages development of the mpiFileUtils project to provide MPI-based tools for handling large file sets on parallel file systems. An avid Buckeye fan; he serves as President of the East Bay Buckeyes Alumni Club in the San Francisco area.
10:30 – 11:00
Break
Abstract
Power is the first order design constraint for building Extreme Scale systems. In addition to power efficient hardware, it is important to co-design system software, which minimizes power consumption. Large scale parallel applications --- which primarily rely on MPI --- exhibit non-deterministic characteristics due to a combination of algorithmic and system factors. Existing approaches, which are either application-dependent/assume repetitive application behavior are insufficient in capturing the execution at large scale. In this talk, Dr. Vishnu will present Energy Aware MVAPICH2 (EAM), which requires no application specific knowledge for automatically saving power, without performance loss. He will present an in-depth study of communication protocols and collective communication primitives, which are required to minimize false positives --- such that application performance is never degraded. He will present an implementation of MVAPICH2 using IB features and demonstrate its effectiveness using ten applications/kernels on large scale InfiniBand cluster.
Bio
Abhinav Vishnu a senior research scientist at Pacific Northwest National Laboratory. Dr. Vishnu's primary interests in designing Scalable, Fault tolerant and Energy Efficient programming models with specific applications to Machine Learning and Data Mining algorithms. Dr. Vishnu has served as a co-editor for several Journals --- Parallel Computing (ParCo), International Journal of High Performance Computing and Applications (IJHPCA) and Journal of Supercomputing (JoS). He has served as a Program Co-chair for several workshops --- Programming Models and Systems Software (P2S2) and ParLearning. He has published over 50 Journal and Conference publications, and his research has been disseminated in several open source software --- MVAPICH2 (High Performance MPI over InfiniBand), Communication Runtime on Extreme Scale (ComEx) and Machine Learning Toolkit for Extreme Scale (MaTEx). Dr. Vishnu completed his PhD from The Ohio State University in 2007, under Dr. Dhabaleswar (DK) Panda.
Abstract
Beacon, the Cray CS300-AC cluster supercomputer that debuted at #1 on the Green500 in November of 2012, relies heavily on Intel Xeon Phi coprocessors for energy-efficient performance. Equipped with four coprocessors per node, Beacon's heterogenous architecture can be challenging to use efficiently, particulary when computing directly on a large set of coprocessors in native mode using MPI-based applications. MVAPICH2-MIC provides an optimized MPI implementation that leverages shared memory, Symmetric Communications InterFace (SCIF) transfers, and InfiniBand channels to efficiently support a variety of host-to-MIC communications. MVAPICH2-MIC also provides proxy-based communications to help overcome the bandwidth limitations for MIC-to-MIC communications imposed by the SandyBridge chipsets employed in Beacon. This presentation will discuss the need for and the deployment of MVAPICH2-MIC on Beacon. Its performance relative to the standard Intel MPI implementation for a variety of benchmarks and scientific applications will be examined, and recommendations for the use of MVAPICH2-MIC on Beacon will be provided.
Bio
Glenn Brook currently serves as the Chief Technology Officer at the Joint Institute for Computational Sciences (JICS) between the University of Tennessee and Oak Ridge National Laboratory. He directs the Application Acceleration Center of Excellence (AACE) and the Intel Parallel Computing Center (IPCC) within JICS, and he is the principal investigator for the Beacon Project, which is funded by the National Science Foundation and the University of Tennessee to explore the impact of emerging computing technologies on computational science and engineering. Over the last four years, Glenn has managed and contributed to the deployment of clusters equipped with Intel® Xeon Phi coprocessors (including Beacon, the energy-efficient supercomputer that tops the November 2012 Green500 list), the porting of numerous application codes to the Intel Xeon Phi coprocessor architecture, the investigation of related programming approaches, and the dissemination of related knowledge and best practices through publications, presentations, and training materials. Glenn also managed the user engagement team within NSF's eXtreme Science and Engineering Discovery Environment (XSEDE) during the first two years of the project, coordinating seamless support and engagement of the user community across XSEDE through oversight of the XSEDE consulting process and feedback mechanisms. Prior to his roles in XSEDE and JICS, Glenn served as a computational scientist at the National Institute for Computational Sciences, where he provided advanced user support, consulting services, and training for users of Kraken, the Cray XT5 supercomputer ranked third on the November 2009 Top500 list.
Glenn graduated magna cum laude with a B.S. in computer engineering with honors certification and minors in chemistry and mathematics from Mississippi State University (MSU) in 1998. He studied computational engineering and applied mathematics as a graduate research fellow at MSU before relocating to the University of Tennessee at Chattanooga (UTC), where he received his M.S. in computational engineering in 2004. In 2008, Glenn earned his Ph.D. in computational engineering from UTC for his work on a parallel, matrix-free Newton method for solving approximate Boltzmann equations on unstructured topologies.
12:00 – 1:00
Lunch
Abstract
The exponential growth in data and the ever growing demand for higher performance to serve the requirements of the leading scientific applications, drive the need for higher scale system and the ability to connect tens-of-thousands of heterogeneous compute nodes in a very fast and efficient way. The interconnect has become the enabler of data and the enabler of efficient simulations. Beyond throughput and latency, the interconnect needs be able to offload the processing units from the communications work in order to deliver the desired efficiency and scalability. 100Gb/s solutions have already been deployed in multiple large scale environments, and even future technologies have been already discussed. The session will review the need for speed, new usage models and how the interconnect play a major role in enabling the path for Exascale performance.
Bio
Gilad Shainer is the vice president of marketing at Mellanox Technologies since March 2013. Previously, Mr. Shainer was Mellanox's vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer holds several patents in the field of high-speed networking and contributed to the PCI-SIG PCI-X and PCIe specifications. Gilad Shainer holds a MSc degree (2001, Cum Laude) and a BSc degree (1998, Cum Laude) in Electrical Engineering from the Technion Institute of Technology in Israel.
Abstract
Martin will present some application tuning work that he has performed using the MVAPICH2 and MVAPICH2-X libraries. Additionally he will present some results around energy savings which can be obtained in collective MPI operations.
Bio
Martin Hilgeman (1973, Woerden, The Netherlands) has an advanced master's degree in Physical and Organic chemistry obtained at the VU University of Amsterdam. He has worked at SGI for 11 years as a consultant and member of the technical staff in the applications engineering group, where his main involvement was in porting, optimization and parallelization of computational chemistry and material sciences applications for MIPS, Intel Itanium 2 and Intel em64t platforms. Martin joined Dell in 2011, where he is acting as a technical lead for the HPC benchmarking group. His particular interests are in application optimization, accelerators, architectural considerations for platform efficiency, MPI single sided messaging and optimization of collectives.
Abstract
Solving challenging problems in computational physics depends on the availability of high performance computing systems and software. At the example of particle-based simulation Molecular Dynamics and Monte Carlo algorithms implemented in the HOOMD-blue software [1], I will discuss how parallelism benefits productivity and enables large-scale simulations of polymers and anisotropic particles including faceted colloids. I will show how current PCIe-based system architectures limit strong scaling in a way that is essentially independent of the underlying implementation of the algorithm. Moreover, I will discuss how one can take full advantage of current GPU-enabled system architectures using HOOMD-blue and MVAPICH2. I will give several examples of simulations in soft condensed matter physics which would have required extreme patience on the side of the researcher, or which would not have been solvable at all using serial algorithms.
Bio
After graduating from Leipzig University, Germany in 2011 with a PhD in Theoretical Physics, Jens Glaser spent two years as a Postdoc at the University of Minnesota, demonstrating the universality of block copolymer melts using computer simulations. He is currently a Postdoc at the University of Michigan in the Department of Chemical Engineering, and is working on problems of protein crystallization, depletion interactions and high-performance particle-based simulation codes.
Abstract
This demonstration will demonstrate our work on scalable and high performance BFS on GPU clusters. Our implementation achieves over 30 billion edges traversed per second on a cluster of 64 GPUs. We discuss how MVAPICH-GDR is used in our implementation, and cover the techniques used to deal with the difficulties involved in in implementing scalable graph algorithms on GPU clusters. We also cover some of our more recent work extending our framework to other graph algorithms.
Bio
James Lewis is a CUDA Research with SYSTAP. He studied at the University of Utah Scientific Computing Institute (SCI) where he received bachelors of science degrees in both Computer Science and Applied Mathematics as a well as a Master’s degree in Computing. In his research work, he developed graph topological metrics to evaluate the performance of aggregation method in the context of multigrid coarsening. He implemented parallel aggregation techniques for multigrid coarsening in C++ and CUDA.
At SYSTAP, he is the lead developer for Blazegraph GPU. He wrote the initial version of the software that uses SpMV techniques to implement Sparql Query evaluation on the GPU. He was the lead CUDA developer for integrating Mapgraph technology with the Merlin Application to accelerate Electronic Warfare using GPU graph capabilities. In this role, he exposed the graph capabilities on the GPU via a Java Native Interface (JNI) to enable the integration without the application developer writing any CUDA, C++, or non-Java code.
3:00 – 3:30
Break
Abstract
Parallel computing is critical to achieving cost-effective, fast-turnaround for training models in deep learning. In this talk I will give a brief overview of algorithms for deep learning using neural networks, and describe parallelization of model training for speech recognition. Our work uses a High Performance Computing (HPC) approach: a cluster of multi-GPU servers, linked via an InfiniBand interconnect, and using CUDA aware Message Passing Interface (MPI) for communication. This design has allowed us to scale training to tens of GPUs while achieving greater than 50% of peak GPU performance. This capability allows us to train models with billions to tens of billions of parameters in no more than a few days.
Bio
Patrick LeGresley is a research scientist in the Baidu Silicon Valley Artificial Intelligence Lab, where he works on hardware and software systems for deep learning. Previously he has worked at NVIDIA and Stanford University, and has done research and consulting in parallel and high performance computing for many application areas. He holds a PhD in Aeronautics and Astronautics from Stanford University.
Abstract
Machine Learning and Data Mining (MLDM) algorithms are important in analysing large volumes of data. MLDM libraries --- such as Mahout, and MLlib --- are gaining traction, yet a native execution using high performance system software is missing. Machine Learning on Extreme Scale (MaTEx) is a toolkit, which provides scalable algorithms for classification, unsupervised learning and frequent pattern mining using MPI and other programming models.
In this talk, Dr. Vishnu will present a case for high performance MVAPICH2 on designing next generation MLDM algorithms. He will present the applicability of existing features, and motivate the need for missing primitives in InfiniBand and MVAPICH2 for building scalable MLDM algorithms. Specifically, he will use examples such as Support Vector Machines (SVM) and Frequent Pattern - Growth (FP-Growth) algorithm to make a case for MPI-two sided and MPI-one sided primitives.
Bio
Abhinav Vishnu a senior research scientist at Pacific Northwest National Laboratory. Dr. Vishnu's primary interests in designing Scalable, Fault tolerant and Energy Efficient programming models with specific applications to Machine Learning and Data Mining algorithms. Dr. Vishnu has served as a co-editor for several Journals --- Parallel Computing (ParCo), International Journal of High Performance Computing and Applications (IJHPCA) and Journal of Supercomputing (JoS). He has served as a Program Co-chair for several workshops --- Programming Models and Systems Software (P2S2) and ParLearning. He has published over 50 Journal and Conference publications, and his research has been disseminated in several open source software --- MVAPICH2 (High Performance MPI over InfiniBand), Communication Runtime on Extreme Scale (ComEx) and Machine Learning Toolkit for Extreme Scale (MaTEx). Dr. Vishnu completed his PhD from The Ohio State University in 2007, under Dr. Dhabaleswar (DK) Panda.
Abstract
The DMTCP project (Distributed MultiThreaded CheckPointing) originated in 2004, with a goal of supporting checkpointing on the desktop for general applications. The number of application domains has since proliferated. This is possible only because of a re-design of the DMTCP architecture, based on modular user-space plugins. Each plugin virtualizes a different external interface, and extends transparent checkpoint-restart to support the corresponding resource. Planned or ongoing extension domains include: supercomputing, engineering desktops (license servers, etc.), cloud management, virtual machines, HaaS ("Hardware as a Service", re-boot in seconds to a new Linux kernel and associated systems processes), GPU-accelerated graphics, big data (e.g., Hadoop/Spark), GPGPU computing, and Android apps.
This talk describes recent experience in transparently supporting generic MPI implementations over many environments, with an emphasis on examples from MVAPICH. Success in this area will transform the traditional multiple HPC "batch queues" into a single "batch pool" of running and suspended jobs. Such an environment new questions. How does one intelligently assign job priorities? How should jobs be co-located on one or more many-core computers? We propose the use of autonomic computing to autonomously fine-tune the low-level parameters, so that systems administrators can concentrate on high-level goals, such as high throughput, low energy use, absolute and relative job priorities with soft or hard deadlines, fairness policies, and so on.
Bio
Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his Ph.D. from Brown University in 1978, and spent six years in basic research at GTE Laboratories, before coming to Northeastern University in 1986. He has been a full professor since 1992. In 2014, he was awarded a five-year IDEX Chair of Attractivity at the University of Toulouse Capitole 1, and LAAS-CNRS.
Since 2004, he has led the DMTCP project (Distributed MultiThreaded CheckPointing). DMTCP has over 9,000 downloads of the source code. He also has a 15-year relationship with CERN, where his work on semi-automatic thread parallelization of task-oriented software is included in the million-line Geant4 high-energy physics simulator. Current research questions include the limits of transparent checkpoint-restart for: supercomputing, engineering desktops (license servers, etc.), virtual machines, GPU-accelerated graphics, cloud management, big data (e.g., Hadoop/Spark), and GPGPU computing.
5:00 – 5:30
Open Mic Session
6:30 –
Banquet Dinner at Bravo Restaurant
1803 Olentangy River Rd
Columbus, OH 43212
Friday, August 21st
7:45 –
Registration
7:45 – 8:30
Breakfast
Abstract
This talk will highlight enabling technology trends in high performance computing, including hardware and software perspectives for the MVAPICH2 stack. As we know, the total power budget that can realistically be deployed for future exascale systems requires energy-efficient innovations to be made across all facets of supercomputing design ranging from on-node (e.g. CPU, memory) enhancements, rack design and thermal cooling improvements, and enhanced high-speed communication layers to support high-bandwidth, low-latency message transfers efficiently. In this talk, we will highlight some of Intel's approach for addressing these challenges, and the software and hardware implications for future generation systems.
Bio
Karl W. Schulz received his Ph.D. in Aerospace Engineering from the University of Texas in 1999. After completing a one-year post-doc, he transitioned to the commercial software industry working for the CD-Adapco group as a Senior Project Engineer to develop and support engineering software in the field of computational fluid dynamics (CFD). After several years in industry, Karl returned to the University of Texas in 2003, joining the research staff at the Texas Advanced Computing Center (TACC), a leading research center for advanced computational science, engineering and technology. During his 10-year term at TACC, Karl was actively engaged in HPC research, scientific curriculum development and teaching, technology evaluation and integration, and strategic initiatives serving on the Center's leadership team as an Associate Director and leading TACC's HPC group and Scientific Applications group during his tenure. He was a Co-principal investigator on multiple Top-25 system deployments serving as application scientist and principal architect for the cluster management software and HPC environment. Karl also served as the Chief Software Architect for the PECOS Center within the Institute for Computational Engineering and Sciences, a research group focusing on the development of next-generation software to support multi-physics simulations and uncertainty quantification. Karl joined the Technical Computing Group at Intel in January 2014 and is presently a Principal Engineer engaged in the architecture, development, and validation of HPC system software.
Abstract
MPI over InfiniBand has demonstrated very good scalability in the past few decades, leading up to multi-Petaflop class systems. Intel® Omni-Path Architecture is designed to deliver the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens—and eventually hundreds—of thousands of nodes. In the past, MVAPICH2 has used the PSM interface to operate on TrueScale Fabric. In this talk we present techniques to design MVAPICH2 on the next generation Intel® Omni-Path Architecture in a scalable fashion.
Bio
Sayantan Sur is a Software Engineer at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.
Abstract
SDSC's newest computational resource Comet features Intel Xeon E5-2680v3 (Haswell) processors with a FDR InfiniBand interconnect. The GPU nodes in the cluster feature two NVIDIA K-80 GPU accelerators each. The cluster has MVAPICH2, MVAPICH2-x, and MVAPICH2-GDR installations available for users. We will present microbenchmark and applications performance results on Comet using both the standard compute and the GPU nodes. Applications will include P3DFFT, HOOMD-Blue, AMBER, and WRF.
Bio
Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, and big data middleware.
10:15 – 10:45
Break
Abstract
Georgia Tech uses a centrally managed federated model for HPC clusters to meet highly varying computational requirements of a large number of researchers throughout entire campus. Computations running on these resources do not necessarily share any computational patterns or workload characteristics, yet the expectation remains that each researcher should utilize these systems fully, and at an utmost efficiency. Mvapich2 has become an indispensable tool in our arsenal to meet the demands of this challenging scientific computing environment, with its power, versatility and relevance. Mvapich2 is at the heart of a large collective software repository serving this diverse research community and continues to be the MPI stack of choice for the majority of researchers. In this talk we will take a close peek at the GT federated model, then discuss how mvapich2 is enabling science and discovery for GT researchers on a daily basis. We will also talk about how HPC support specialists in our team benefit not only from this tool but also the exceptional community for reproducible software builds, sustainable performance and improved troubleshooting capabilities.
Bio
Mehmet Belgin studied Naval Architecture and Ocean Engineering at Istanbul Technical University (ITU), Turkey. After his graduation, he completed two M.Sc. degrees in Ocean Engineering and Computer Science at ITU, followed by a Computer Science and Applications Ph.D. degree in 2010 at Virginia Tech, USA. Mehmet’s Ph.D. research primarily focused on developing new pattern-based representations to improve the performance of sparse matrix-vector multiply kernels dominating most Krylov subspace solvers. During his work as a GRA at Virginia Tech, Mehmet provided research computing support for the SystemX cluster, ranked 3rd in the 2003 Top500 list. Since his employment as a Research Scientist in 2011, Mehmet has been serving Georgia Tech Researchers with all kinds of HPC related needs ranging from systems management to scientific computing consultation.
Abstract
MVAPICH2-X supports both MPI and PGAS programming models on InfiniBand clusters. PGAS models represent a great alternative for future implementations, but questions regarding how much effort is required to use these models as well as the performance that can be achieved by PGAS on real applications need to be addressed. We will present our experience implementing OpenSHMEM versions of existing scientific applications, from the amount of work required to have an initial implementation to more advanced techniques aimed to achieve good performance compared to existing MPI implementation of the applications as well as good scalability. Results showing when and how OpenSHMEM, with the unified runtime provided by MVAPICH2-X, can be an optimal option for scientific applications encourage further efforts in using these models.
Bio
Antonio Gomez-Iglesias is a Research Associate at Texas Advanced Computing Center, where he is a member of the HPC Tools groups, since 2014. His current research interests are performance analysis and optimization, profiling and debugging tools, large-scale problem solving, tasks scheduling and heterogeneous infrastructures. Before TACC, Antonio was a Postdoctoral fellow and the Operations Research group of CSIRO (Melbourne, Australia) where he worked as specialist in large-scale problem solving in the area of operations research. Prior to joining CSIRO, as a member of the National Fusion Laboratory in Spain, he participated in several international projects in Europe focused on grid computing and HPC and was a member of a devoted HPC team supporting nuclear fusion scientist using their codes in the HPC-FF cluster. He obtained his PhD in Computer Science in 2011 from the University of Extremadura (Spain).
Abstract
This talk focuses on the experience of checkpointing large-scale applications running at clusters and supercomputers. In particular, we shows the performance of DMTCP (Distributed MultiThreaded Checkpointing) in the cluster environment, where high-performance hardware (InfiniBand) and software (computing software such as MVAPICH2, OpenMPI, Intel MPI, resource management software such as Slurm, Torque) are used together. While BLCR requires additional kernel module as well as modification of software to support distributed applications, we're trying to do this in a transparent way: independent of the MPI implementation, the resource manager, and the Linux version or configuration. Further, we do not require a separate checkpoint-restart service, since DMTCP already natively supports distributed processes.
Bio
Jiajun Cao and Rohan Garg are current Ph.D. students from the Computer and Information Science Department at Northeastern University. They are working in the High-Performance Computing Lab with Prof. Gene Cooperman. Specifically, they have been working on transparent checkpointing support for distributed applications, including checkpointing of 3D-graphics applications, checkpointing for a network of virtual machines, checkpointing of InfiniBand-based applications, etc.