MUG'14

Conference Location: Ohio Supercomputer Center Bale Theater

MUG'14 meeting attendees gather for a group photo.


Monday, August 25th

12:00 –

Registration

12:00 – 1:00

Lunch

Abstract

A large number of HPC users use High Performance MPI libraries like MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and MVAPICH2-MIC on a daily-basis to run their MPI, PGAS and accelerator/co-processor enabled applications. However, many of these users and the corresponding system administrators are not fully aware of all features, optimizations and tuning techniques associated with these libraries. To make matters worse, several people tend to use these libraries as a "black-box" leading to sub-par performance. This tutorial aims to address these concerns and educate users and administrators of HPC systems on how to adapt MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and MVAPICH2-MIC to deliver peak performance for different classes of HPC applications on different system configurations. During the course of the tutorial, we will give guidelines on developing a set of "best-practices" for running different classes of HPC applications on different system configurations with MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and MVAPICH2-MIC.

The tutorial will start with an overview of popular MPI libraries for modern homogeneous and heterogeneous systems (such as MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and MVAPICH2-MIC) and their features. Next, we will take an in-depth look at the different runtime optimizations users can take advantage of to extract best performance from these libraries such as:

  • Impact of process mapping on inter-node and intra-node performance
  • Using InfiniBand hardware multicast to enhance collective performance
  • Transport protocol (RC, UD, XRC and Hybrid) tuning
  • Taking advantage of newer MPI3 features like MPI3-RMA and non-blocking collectives
  • Utilizing GPUDirect (GDR) RDMA, gdrcopy and similar techniques to enhance GPU-GPU communication
  • Enhanced optimization strategies for Intel-MIC based clusters
  • Enhancing the performance PGAS and Hybrid MPI+PGAS applications using MVAPICH2-X
  • Impact of process mapping on inter-node and intra-node performance

In addition, an overview of configuration and debugging support in MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and MVAPICH2-MIC will be presented. Advanced optimization and tuning of MPI applications using the new MPI-T feature (as defined by MPI-3 standard) in MVAPICH2 will also be discussed. The impact on performance of the various features and optimization techniques will be discussed in an integrated fashion.

The tutorial is organized along the following topics:

  • Overview and Features of MVAPICH2, MVAPICH2-X, MVAPICH2-GDR and, MVAPICH2-MIC
  • Runtime Optimization and Tuning Flexibility in MVAPICH2
    • Job start-up
    • Process mapping
    • Pt-to-pt Inter-node protocol (Eager, Rendezvous and Put/Get/R3)
    • Transport type selection (RC, XRC, UD and Hybrid (RC/XRC/UD))
    • Pt-to-pt Intra-node protocol and Scheme (LiMIC2 and CMA)
    • MPI-3 RMA Operations
    • Collectives (blocking and non-blocking)
    • Fault-tolerance (Checkpoint Restart (CR), Migration and Scalable Checkpoint Restart (SCR))
  • Runtime Optimization and Tuning Flexibility in MVAPICH2-GDR
    • GPU Support (pt-to-pt, GDR, gdrcopy and datatypes)
  • Runtime Optimization and Tuning Flexibility in MVAPICH2-MIC
    • MIC Support (Inter-/Intra- Host/MIC pt-to-pt, and collectives)
  • Runtime Optimization and Tuning Flexibility in MVAPICH2-X
    • UPC applications
    • OpenSHMEM applications
    • Hybrid MPI+(UPC/OpenSHMEM) applications
  • Advanced Optimization and Tuning of MPI Applications using MPI-T features in MVAPICH2, MVAPICH2-X, MVAPICH2-GDR, and MVAPICH2-MIC
  • Overview of Installation, Configuration and Debugging Support
  • Conclusions, Final Q&A, and Discussion

  • Bio

    MVAPICH

    The tutorial will be presented by several members of the MVAPICH team. The MVAPICH project, lead by Network-Based Computing Laboratory (NBCL) of The Ohio State University. MVAPICH2 software, supporting MPI 3.0 standard, delivers best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies. This software is being used by more than 2,150 organizations world-wide in 72 countries (Current Users) to extract the potential of these emerging networking technologies for modern systems. As of Jul '14, more than 218,000 downloads have taken place from this project's site. This software is also being distributed by many InfiniBand, 10GigE/iWARP and RoCE vendors in their software distributions. MVAPICH2-X software package provides support for hybrid MPI+PGAS (UPC and OpenSHMEM) programming models with unified communication runtime for emerging exascale systems. MVAPICH2 software is powering several supercomputers in the TOP 500 list.

6:30 – 8:00

Reception at Buffalo Wild Wings

2151 N High St

Columbus, OH 43201

Easy 10 minute walk from the Blackwell Hotel


Tuesday, August 26th

7:45 –

Registration

7:45 – 8:20

Breakfast

8:20 – 8:30

Opening Remarks

PDF

Abstract

The Stampede system began production operations in January 2013. The system was one of the largest ever deployments of MVAPICH, with a 6,400 node FDR InfiniBand fabric connecting more than 2PF of Intel Xeon processors. The system also was the first large scale installation of the Intel many core Xeon Phi Co-Processors, which also used MVAPICH for communications.

This talk will discuss the experiences over the first 1.5 years of production with MVAPICH and Stampede. The talk will cover some science results from the system, but will focus on scaling results from the base Xeon cluster and the IB network, experiences with both native and symmetric mode MPI from the Xeon Phi over the PCI bus, and will show dramatics improvements made in performance during the production life of the system due to improvements in MVAPICH as a result of testing with Stampede.

Bio

Dan Stanzione

Dan Stanzione is the Executive Director of the Texas Advanced Computing Center at the University of Texas at Austin and the Principal Investigator for Wrangler. He is also a Co-investigator for Texas Advanced Computing Center’s 10 PetaFlop Stampede supercomputer, and has previously been involved on the deployment and operation of the Ranger and Lonestar supercomputers at Texas Advanced Computing Center. He is the Co-Director of The iPlant Collaborative, an ambitious endeavor to build cyberinfrastructure to address the grand challenges of plant science. Prior to joining Texas Advanced Computing Center, Dr. Stanzione was the founding director of the Fulton High Performance Computing Initiative at Arizona State University. Before ASU, he served as an AAAS Science Policy Fellow at NSF and as research professor at Clemson University, his alma mater.

Abstract


Bio

Dhabaleswar K. (DK) Panda

Dhabaleswar K. (DK) Panda is a Professor of Computer Science and Engineering at the Ohio State University. He has published over 300 papers in major journals and international conferences. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, High-Speed Ethernet and RDMA over Converged Enhanced Ethernet (RoCE). The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X software libraries, developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,150 organizations worldwide (in 72 countries). This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 218,000 downloads of this software have taken place from the project's website alone. This software package is also available with the software stacks of many network and server vendors, and Linux distributors. The new RDMA-enabled Apache Hadoop package, consisting of acceleration for HDFS, MapReduce and RPC, is publicly available from http://hadoop-rdma.cse.ohio-state.edu. Dr. Panda's research has been supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, Cray, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda

Abstract

MPI in the national interest. The U.S. government tasks Lawrence Livermore National Laboratory with solving the nation's and the world's most difficult problems. This ranges from global security, disaster response and planning, drug discovery, energy production, and climate change to name a few. To meet this challenge, LLNL scientists utilize large-scale computer simulations on Linux clusters with InfiniBand networks. As such, MVAPICH serves a critical role in this effort. In this talk, I will highlight some of this recent work that MVAPICH has enabled.


Bio

Adam Moody

Adam Moody earned a Master's degree in Computer Science and a Master's degree in Electrical Engineering from The Ohio State University in 2003. Since then, he has worked for Livermore Computing at the Lawrence Livermore National Laboratory, where he supports MPI on large-scale Linux clusters. He leads the Scalable Checkpoint Restart (SCR) library project, which provides fast checkpointing mechanisms for large-scale MPI applications. He also manages development of the FileUtils project to provide MPI-based tools for handling large file sets on parallel file systems. An avid Buckeye fan; he serves as President of the East Bay Buckeyes Alumni Club in the San Francisco area.

10:30 – 11:00

Break

Abstract

Improving earthquake ground motion estimates based on numerical large-scale earthquake rupture simulations is one of the grand challenges in earth sciences. Earthquake faulting is a complex process occurring on multiple scales and at depths that cannot be probed directly. By generating predictions of earthquake behavior under controllable but realistic conditions, applied earthquake research impacts the design of seismic hazard mitigation measures as well as, the risk management of industrial assets. Better understanding of earthquake faulting and the simulation of future hypothetical seismic events will give a chance of appropriate preparation for the next ‘big one’. To this end, we present the highly optimized and scalable Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) code SeisSol used for earthquake simulations. The implementation exploits unstructured meshes to flexibly adapt for complex geometries in realistic geological models. Seismic wave propagation is solved in combination with earthquake faulting in a multiphysical manner leading to a heterogeneous solver structure. Our optimizations cover all software levels including state-of-the-art kernels tuning down to machine instruction level and overlapping hybrid MPI-OpenMP communication shadowing the multi-physics computations. We demonstrate SeisSol's excellent parallel efficiency on heterogeneous supercomputers featuring Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors achieving sustained multi-petaflop performance on the Stampede supercomputer.


Bio

Alexander Heinecke

Alexander Heinecke studied Computer Science and Finance and Information Management at Technische Universität München, Germany. In 2010 and 2012, he completed internships in the High Performance and Throughput Computing team at Intel, Munich, Germany and at Intel Labs Santa Clara, CA, USA, working on the Intel MIC architecture. In 2013 he completed his Ph.D. studies at Technische Universität München, Germany. Core research topic of his thesis was the use of multi- and many-core architectures in advanced scientific computing applications.

Alexander Heinecke was awarded the Intel Doctoral Student Honor Programme Award in 2012. Together with Intel Labs, Appro Computer International (now Cray) and NICS he placed the Beacon System #1 in Nov. 12 Green500 list. In 2013 he and his co-authors received the PRACE ISC 2013 Award for their publication “591 TFLOPS Multi-Trillion Particle Simulation on SuperMUC”. He and his co-authors were awarded the 2014 PRACE ISC Award for their paper "Sustained Petascale Performance of Seismic Simulations with SeisSol on SuperMUC".

Abstract

MPI implementations are evolving rapidly for emerging hybrid nodes containing GPGPU and Intel Xeon Phi devices. In order to improve efficiency and portability of applications on hybrid and accelerated architectures, developers attempt to harness innovations in MPI extensions such as GPU aware MPI that exploits Nvidia GPUDirect and unified memory architecture. However, as configurations of accelerated nodes diversify by incorporating, for instance, multiple accelerator devices per node, hierarchical PCIe connectivity, deep cache and NUMA architectures, etc. application developers are faced with wide ranging parameters for tuning and scaling applications while HPC service providers face challenges in diagnosing sources of performance degradation and interruptions in 24/7 operational environments. In this talk, I provide an insight into current GPU aware implementations of MVAPICH on clusters with up to eight GPU devices per node and offer an overview of current solutions and future outlook for ensuring a robust operational environment for accelerated MPI applications.


Bio

Sadaf Alam

Sadaf Alam is Head of HPC Operations and Chief Architect at the Swiss National Supercomputing Centre. She was previously a computer scientist at the Oak Ridge National Laboratory, USA. She holds a PhD in Computer Science from the University of Edinburgh, UK.

12:00 – 1:00

Lunch

Abstract


Bio

Gilad Shainer

Gilad Shainer is the vice president of marketing at Mellanox Technologies since March 2013. Previously, Mr. Shainer was Mellanox's vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer holds several patents in the field of high-speed networking and contributed to the PCI-SIG PCI-X and PCIe specifications. Gilad Shainer holds a MSc degree (2001, Cum Laude) and a BSc degree (1998, Cum Laude) in Electrical Engineering from the Technion Institute of Technology in Israel.

Abstract

Power-efficient accelerators, most preeminently GPUs, have a tight grip on the HPC arena. In spite of that, when it comes to high-performance interconnects, the common feeling is that there is still a lot of work in bridging the gap with respect to traditional super-computer architectures. In this talk, we will try to examine the current state of the matter, both in the corporate and in the academic world. Eventually we will briefly discuss a few lines of research which we are currently exploring.


Bio

Davide Rossetti

Davide Rossetti has a degree in Theoretical Physics from Sapienza Rome University and is currently a senior engineer at NVIDIA Corp. His main research activities are in the fields of design and development of parallel computing and high-speed networking architectures optimized for numerical simulations, while his interests spans different areas such as HPC, computer graphics, operating systems, I/O technologies, embedded systems, real-time systems, etc.

Abstract

Cray, Inc. leverages many open source products in our cluster stack. The mvapich2 products (mvapich2, mvapich2-X and mvapich2-GDR) allows our users to have an expanded set of capabilities in a robust, open source environment. This discussion highlights typical additional coverage and includes points where additional convergence of runtime environments would be beneficial to open source HPC stacks.


Bio

David Race serves as Senior Application Engineer, Cray Cluster Solutions (CCS), responsible for open source product investigation, optimization and tuning across the current CCS product lines. These activities include both the direct study of open source products as well as the impacts of the latest products on the CCS ecosystem and customer applications. Dr. Race has more than 20 years of experience in the high performance computing industry beginning with Thinking Machines in the 1980's. Prior to joining Cray in 2012, Dr. Race has worked with products from several HPC companies, including Cray, SGE, Sun, HP and NEC. In addition to an extensive background in the hardware history of the HPC industry, Dr. Race has worked in several industry segments including Oil & Gas, weather, aerospace and government research. Dr. Race received a B.S. in mathematics from Arkansas Tech University, an M.A. and Ph.D. in mathematics from the University of North Texas.

Abstract

High performance computing is an evolving space encapsulating newer and interesting technologies as it grows. There are many options to choose from in terms of compute, storage, networking, software stack, middleware etc. With all these choices at hand, it can be quiet a challenging to obtain the optimal configuration for the problem you are trying to solve. Apart from the above mentioned choices, which contribute to an HPC solution’s performance, there is also a dependency on energy efficiency or power consumption. At Dell HPC labs, we work towards making these choices easier by providing best value solutions for a number of HPC domains. We evaluate new HPC technologies and selectively adopt them for integration into our solutions. We also create turnkey architectures which are aimed at providing optimal performance and energy efficiency for specific types of HPC workloads. This presentation is aimed at giving a broad overview into the studies we perform at the Dell HPC labs which include creating best practices for applications from various domains, generating performance and scalability studies, tuning and optimizing compute, storage, networking for optimal results, etc.


Bio

Onur Celebioglu is the engineering Director for Dell’s High Performance Computing (HPC) solutions engineering team. His responsibilities include design, development, integration and performance characterization of Dell’s HPC and business analytics solutions. His primary areas of focus are performance analysis of scale-out systems, parallel file systems, cluster management tools, high speed network interconnects, accelerators and generation of best practices on the use of these technologies. He holds an M.S. in Electrical and Computer Engineering from Carnegie Mellon University.

3:00 – 3:30

Break

Abstract

I will discuss how the need to perform large-scale simulations of block copolymer liquids led to the development of a scalable GPU molecular dynamics code. In particular, I will present on the bottlenecks observed and lessons learnt in scaling the HOOMD-blue code to 1000's of GPUs. Because GPUs are extremely fast, optimizing communication latency is key in obtaining good strong scaling performance. Comparison to the LAMMPS molecular dynamics code will be presented for the Titan supercomputer, as well as a demonstration of how GPUDirect RDMA can further improve scaling performance of HOOMD-blue.


Bio

Jens Glaser

Dr. Jens Glaser is a Research Fellow at the University of Michigan, Ann Arbor, in the group of Prof. Glotzer in the Chemical Engineering Department. He graduated from University of Leipzig, Germany, in 2011, in Theoretical Physics. During a two-year stay at University of Minnesota he carried out simulations of block copolymer melts, and was involved in the development of HOOMD-blue. He contributed MPI scaling capabilities to the code, and now works on self-assembly simulations of nano-objects such as biological molecules.

Contributed Presentation Session

Abstract

Sandia National Laboratories has a diverse set of HPC applications and a variety of platforms for which the MPI library robustness and performance are critical. The presentation will cover Sandia's needs, a few case studies of performance and scaling challenges seen with our applications, Sandia’s Mantevo miniapps and their use, and, a study of alternate MPI libraries on our application performance. The impact on performance of the inter-node MPI message rate with our implicit codes is a focus.


Bio

Mahesh Rajan

Mahesh is a Distinguished Member of the Technical Staff at Sandia National Laboratories. He has been with Sandia since Feb. 2002. Prior to that he worked at Caltech, IBM, Intel SSD, MasPar and Supecomputing Solutions on HPC. He was a tenured faculty at ASU, serving from 1981 to 1988.

Abstract

Abstract—Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many core devices, such as NVIDIA R GPUs and the Intel R Xeon Phi R , will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on super-computers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware by making use of MAVPICH and GPU Direct.


Bio

Harish Dasari

Harish is a Graduate Research Assistant working with Dr. Martin Berzins and his group at the Scientific Computing and Imaging Institute, University of Utah. Harish is part of a DARPA funded project led by SYSTAP on large scale graph processing platforms on GPU clusters by using recent advances in MPI for GPU communications (like GPUDirect). He received his Bachelors degree from K L University, India. Prior to joining University of Utah, he was working as a software engineer at HCL Technologies, India. His interests lie in high performance computing, parallel computing and GPU computing.

Abstract

DMTCP (Distributed MultiThreaded CheckPointing) is a project with a large user community. This open source package is available in most major Linux distributions. DMTCP supports dynamic compression of checkpoints, forked checkpointing, and on-demand paging for fast restart. Further, a plugin architecture enables users to easily build their own extensions to checkpointing, including customizable support for incremental checkpointing, parity-based checkpointing, remote checkpointing, remote restart, and in-memory checkpointing.

Recent extensions to DMTCP, such as the support for Torque and SLURM resource managers, now make it especially suitable for MPI. DMTCP transparently checkpoints an MPI application "from the outside", without any modification needed to MPI itself. The underlying assumption is that the MPI implementation does not interact with "external" non-user space dependencies such as specialized kernel modules or hardware such as InfiniBand. Recently, a DMTCP plugin was created to transparently checkpoint InfiniBand from the user-space. Other externalities can be dealt with by creating plugins specializing in individual external dependencies. This talk will explore further challenges involved in integrating DMTCP with MVAPICH for providing fault-tolerance and process migration.


Bio

Professor Cooperman researches high-performance computing and scalable applications in computational algebra. He received his Ph.D. from Brown University in 1978, and spent six years in basic research at GTE Laboratories, before coming to Northeastern University in 1986. He has been a full professor since 1992. His current areas of interest include checkpoint-restart, virtualization, high performance computing, and Cloud Computing. He created the DMTCP checkpoint-restart project in late 2004. DMTCP (Distributed MultiThreaded CheckPointing) is now the most widely used transparent user-space checkpointing package. He also has a 15-year-long relationship with CERN, where his investigation of semi-automatic thread parallelization of task-oriented software, has recently been incorporated into version 10.0 of Geant4, a million-line program for Monte Carlo simulations of particle-matter interaction.

5:00 – 5:30

Open Mic Session

6:30 –

Banquet Dinner at Bravo Restaurant

1803 Olentangy River Rd

Columbus, OH 43212


Wednesday, August 27th

7:45 –

Registration

7:45 – 8:30

Breakfast

Abstract

The increasing complexity of systems from a hardware standpoint is providing new challenges in their efficient utilization. Processor and memory heterogeneity, increased levels of concurrency, the promise of new technologies, and also the need for considering performance, energy efficiency and reliability are no easy tasks. In this talk we will discuss several components of PNNL’s workload that are MPI based and how these are efficiently using two of our main systems in place today, a lab wide institutional computing cluster, and a multi-petaflop Intel Phi system. We will also exemplify several aspects of current research being applied to these applications that are aimed at yet further improving performance and energy efficiencies. We will also touch upon the need to be aware of new technologies and thus to help prepare applications for yet more advanced system.


Bio

Darren Kerbyson

Darren Kerbyson is a laboratory Fellow and Lead of the High Performance Computing group at PNNL. Prior to this he was the Team Lead of the Performance and Architecture Lab (PAL) at Los Alamos National Laboratory and until 2001 he was a senior Lecturer in Computer Science at the University of Warwick in the UK. He received his BSc in Computer Systems Engineering in 1988, and PhD in Computer Science in 1993 both from the University of Warwick (UK). Between 1993 and 2001 he was a Senior Faculty member of Computer Science at Warwick. His research interests include performance evaluation, performance and power modeling, and optimization of applications on high performance systems as well as image analysis. He has published over 140 papers in these areas over the last 20 years.

Abstract

NWChem is a well-known quantum chemistry package designed for massively parallel supercomputers. The basis for NWChem's parallelism is the Global Arrays programming model, which supports distributed arrays, dense linear algebra, flexible one-sided communication and dynamic load-balancing. The low-level communication runtime of Global Arrays is called ARMCI. Dinan and coworkers first mapped ARMCI to MPI-2 remote memory access (RMA), which helped drive the development of the MPI-3 standard. We will describe our implementation of ARMCI using MPI-3 RMA and performance results showing the scalability of NWChem on multiple platforms. In particular, the MVAPICH2 implementation of MPI-3 delivers excellent performance and scalability on InfiniBand systems.


Bio

Jeff Hammond

Jeff Hammond is a Research Scientist in the Parallel Computing Lab at Intel Labs. His research interests include: one-sided and global view programming models, load-balancing for irregular algorithms, and shared-and distributed-memory tensor contractions. He has a long-standing interest in enabling the simulation of physical phenomena - primarily the behavior of molecules and materials at atomistic resolution - with massively parallel computing.

Prior to joining Intel, Jeff was an Assistant Computational Scientist at the Argonne Leadership Computing Facility and a Fellow of the University of Chicago Computation Institute. He was a Director's Postdoctoral Fellow at Argonne from 2009 to 2011. In 2009, Jeff received his PhD in chemistry from the University of Chicago as a Department of Energy Computational Science Graduate Fellow. He graduated from the University of Washington with degrees in chemistry and mathematics in 2003.

The IEEE Technical Committee on Scalable Computing named Jeff a Young Achiever in Scalable Computing in 2014 for his work on massively parallel scientific applications and runtime systems.

Abstract

MVAPICH2-X provides a unified high-performance runtime that supports both MPI and PGAS programming models on InfiniBand clusters. We will present microbenchmark and applications performance results on Gordon using MPI, MPI+OpenMP, pure UPC, pure OpenSHMEM, and hybrid (MPI+PGAS) approaches. Applications will include NAS parallel benchmarks (Scalable Penta-diagonal (SP), Conjugate Gradient (CG), and Multi-Grid (MG), P3DFFT (MPI+OpenSHMEM), and CP2K (MPI+OpenMP).


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at San-Diego Supercomputing Center. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, and topology aware communication and scheduling. Additionally, he has taught several classes and been involved in workshops focusing on high performance computing, numerical modeling, and data intensive computing including the deployment and use of Hadoop on the San-Diego Supercomputing Center resources.

10:30 – 11:00

Break

Abstract

MPI over InfiniBand has demonstrated very good scalability in the past few decades, leading up to Petaflop class systems. Research results produced by the MVAPICH/MVAPICH2 team at The Ohio State University have contributed significantly to the continued scalability of MPI over InfiniBand. As we head into an era of increasing levels of integration of fabrics with the processing elements, it is worthwhile to re-examine the lessons learned from the past research and their continued applicability into the future.

In this talk, we present Scalable Fabric Interfaces, being developed in the OpenFabrics Interfaces Working Group and describes how it incorporates some of the lessons learned from past research results from The Ohio State University.


Bio

Sayantan Sur

Sayantan Sur is a Software Engineer at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.

Abstract

Each release of MVAPICH2 provides many new and interesting features that help TACC users improve the communication performance of their parallel applications. A range of elements, from the reduction of startup times to intra/inter node communication improvement and tuning, are important to the performance of users’ applications. During this presentation, we will cover some of these recent improvements and will explain how these features can help applications running everyday on TACC HPC clusters.


Bio

Jerome Vienne

Jerome Vienne is a Research Associate of the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. Jerome’s research interests are in Performance Analysis and Modeling, High Performance Computing, High Performance Networking, Benchmarking and Exascale Computing. Before joining TACC, Dr. Vienne was a Postdoctoral Researcher at the Department of Computer Science and Engineering at The Ohio State University. He received his Ph.D. degree from the Joseph Fourier University (France) in 2010 and has published more than 15 papers in major conferences.

Abstract

Quantum ESPRESSO (QE) is an integrated suite of computer codes for electronic-structure calculations and materials modeling at the nano-scale. As many other quantum chemistry codes based on plane-wave, dual-space formalism is exploited durint the Iterative diagonalization. A parallel 3D-FFT is used to transform electronic wave functions from reciprocal to real space and vice versa. The 3D FFT is parallelized by distributing planes of the 3D grid in real space to processors (in reciprocal space, it is columns of G-vectors that are distributed to processors). Such data distribution facilitates linear algebra operations using a friendly space representation but introduce heavy communication especially when the number of processors exceeds the number of FFT planes. Tackling the 3D-FFT in a efficiently way in current GPU-accelerated system has been an long term objective that is hopefully coming to an end. A new design that includes combining CUFFT and MPI communication leveraging GPU Direct over RDMA is presented. This design aims to be compatible with all parallelization level implemented, from pool down to task group. Simple ideas come with challenges in implementation, portability and performance. MVAPICH2-GDR is the software that allow to explore and push forward all these aspects. This new code is available upon request while extensive testing is still performed to understand performance implications and extendibility of this approach.


Bio

Filippo Spiga

Filippo Spiga is a HPC Application Specialist working at the High Performance Computing Service (HPCS) at the University of Cambridge. Previously he worked in top-level Research institutes/High Performance Computing centres (ICHEC, CINECA, CERN), and in Enterprise R&D (IBM Research) as well as wide multi-institutional collaborations (PRACE and EUAsiaGrid). In HPCS is mainly responsible of a set of work-packages related to HPC in the Programme for Simulation Innovation (PSi), a 5-year funded project by EPRSC in partnership with Jaguar Land Rover. As member of the Quantum ESPRESSO Foundation, he has responsibility of several aspects of the GPU-accelerated Quantum ESPRESSO project: new developments, bug fixing, code maintenance and dissemination. His main interests cover general High Performance Computing topics (especially mixed programming), GP-GPU programming, application optimization and, recently, remote visualization technologies and low-power micro-architecture.

12:15 – 12:30

Closing Remarks and Future MUG Planning

12:30 – 1:30

Lunch

1:30 – 3:30

Interactive/Hands-on Session with MVAPICH Developers

MUG'14 Sponsors