MUG'18

Final Program

Conference Location: Ohio Supercomputer Center Bale Theater

MUG'18 meeting attendees gather for a group photo.

Monday, August 06

8:45 - 9:30

Registration and Continental Breakfast

Abstract

Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. This tutorial will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. An approach for identifying communication bottlenecks will be presented that builds from the basics up to advanced features in Arm Forge. We will explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

10:30 - 11:00

Break

Abstract

Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. This tutorial will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. An approach for identifying communication bottlenecks will be presented that builds from the basics up to advanced features in Arm Forge. We will explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

Abstract

High performance computing has begun scaling beyond Petaflop performance towards the Exaflop mark. One of the major concerns throughout the development toward such performance capability is scalability - at the component level, system level, middleware and the application level. Mellanox’s Co-Design approach between the development of the software libraries and the underlying hardware can help to overcome those scalability issues and to enable a more efficient design approach towards the Exascale goal. In the tutorial session we will review latest development areas within the Co-Design architecture: SHArP Technology, Unified Communication X (UCX), Hierarchical Collectives, GPU architectures support, etc.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr.Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHArP, HCOLL, GPU acceleration ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda. At Nowlab, Devendar involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master's degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

12:30 - 1:30

Lunch

Abstract

High performance computing has begun scaling beyond Petaflop performance towards the Exaflop mark. One of the major concerns throughout the development toward such performance capability is scalability - at the component level, system level, middleware and the application level. Mellanox’s Co-Design approach between the development of the software libraries and the underlying hardware can help to overcome those scalability issues and to enable a more efficient design approach towards the Exascale goal. In the tutorial session we will review latest development areas within the Co-Design architecture: SHArP Technology, Unified Communication X (UCX), Hierarchical Collectives, GPU architectures support, etc.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr.Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHArP, HCOLL, GPU acceleration ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda. At Nowlab, Devendar involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master's degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

Abstract

Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization is becoming more and more attractive. The recently introduced Single-Root I/O Virtualization (SR-IOV) technique for InfiniBand and High-Speed Ethernet on HPC clusters provides native I/O virtualization capabilities and opens up many opportunities to design efficient HPC clouds. However, SR-IOV also brings additional design challenges arising from lacking support of locality-aware communication and virtual machine migration. This tutorial will first present an efficient approach to build HPC clouds based on MVAPICH2 over SR-IOV enabled HPC clusters. High-performance designs of virtual machine (KVM) and container (Docker, Singularity) aware MVAPICH2 library (called MVAPICH2-Virt) will be introduced. This tutorial will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The second part of the tutorial will present advanced designs with cloud resource managers such as OpenStack and SLURM to make users easier to deploy and run their applications with the MVAPICH2 library on HPC clouds. A demo will be provided to guide the usage of MVAPICH2-Virt library.


Bio

Xiaoyi Lu

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, Cloud Computing, and Deep Learning. He has published more than 100 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, and Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 285 organizations from 34 countries. More than 26,950 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932/.

3:00 - 3:30

Break

Abstract

Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization is becoming more and more attractive. The recently introduced Single-Root I/O Virtualization (SR-IOV) technique for InfiniBand and High-Speed Ethernet on HPC clusters provides native I/O virtualization capabilities and opens up many opportunities to design efficient HPC clouds. However, SR-IOV also brings additional design challenges arising from lacking support of locality-aware communication and virtual machine migration. This tutorial will first present an efficient approach to build HPC clouds based on MVAPICH2 over SR-IOV enabled HPC clusters. High-performance designs of virtual machine (KVM) and container (Docker, Singularity) aware MVAPICH2 library (called MVAPICH2-Virt) will be introduced. This tutorial will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The second part of the tutorial will present advanced designs with cloud resource managers such as OpenStack and SLURM to make users easier to deploy and run their applications with the MVAPICH2 library on HPC clouds. A demo will be provided to guide the usage of MVAPICH2-Virt library.


Bio

Xiaoyi Lu

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, Cloud Computing, and Deep Learning. He has published more than 100 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, and Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 285 organizations from 34 countries. More than 26,950 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932/.

6:00 - 9:30

Reception Dinner at Brazenhead

1027 W 5th Ave

Columbus, OH 43212

Tuesday, August 07

7:45 - 8:15

Registration and Continental Breakfast

8:15 - 8:30

Opening Remarks

Dave Hudak, Executive Director, Ohio Supercomputer Center
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

The MPI Forum is currently working towards version 4.0 of the MPI standard, which is likely to include major feature additions like persistent collectives, improved error handling, large count extensions and an events-based tools interface. In the first part of this talk I will highlight these promising directions and discuss their state wrt. standardization. I will also discuss the overall direction of the MPI forum as well as the anticipated timeline for the next standard release. However, just defining new features in the standard alone is not sufficient — they must also be adopted in implementations. This can be a lengthy process requiring significant amounts of work. Open source distributions, like MVAPICH, play a critical role in these efforts and have, already in the past, been major drivers towards adoption. In the second part of my talk I will highlight this importance and use the adoption of the MPI_T interface, added in MPI 3.0, in MVAPICH as an example. The availability of a wide diversity of MPI_T variables offered by MVAPICH has not only demonstrated the value of the interface to the end user, but has also enabled matching developments of tools making use of MPI_T.


Bio

Martin Schulz

Martin Schulz is a Full Professor at the Technische Universität München (TUM), which he joined in 2017. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. His research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, and MVAPICH2-Virt) will be presented. Current status and future plans for OSU INAM, OEMT, and OMB will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,900 organizations worldwide (in 85 countries). More than 480,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 2nd, 12th, 15th, 24th and 62nd ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 285 organizations in 34 countries. More than 26,900 downloads of these libraries have taken place. He is an IEEE Fellow. The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. A high-performance and scalable version of the Caffe framework is available from High-Performance Deep Learning (HiDL) Project site. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

10:15 - 10:45

Break

Abstract

An overview of the recently awarded NSF Tier 1 System at TACC will be presented. The presentation will also include discussion on MVAPICH collaboration on past systems and this upcoming system at TACC


Bio

Dan Stanzione

Dr. Stanzione is the Executive Director of the Texas Advanced Computing Center (TACC) at The University of Texas at Austin since July 2014, previously serving as Deputy Director. He is the principal investigator (PI) for a National Science Foundation (NSF) grant to deploy and support Stampede2, a large scale supercomputer, which will have over twice the system performance of TACC’s original Stampede system. Stanzione is also the PI of TACC's Wrangler system, a supercomputer for data-focused applications. For six years he was co-director of CyVerse, a large-scale NSF life sciences cyberinfrastructure. Stanzione was also a co-principal investigator for TACC's Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

At Lawrence Livermore National Laboratory, high-performance computing is an essential tool for scientists and researchers who work to solve problems of national and global interest. MVAPICH serves a critical role in this effort. In this updated, fan-favorite*, world-famous* talk, I highlight recent advances in science and technology that MVAPICH has enabled. I also discuss Livermore Computing’s experience with a support contract with MVAPICH developers to make improvements that improve MPI for our scientists.


Bio

Adam Moody

Adam is a member of the Development Environment Group within Livermore Computing. His background is in MPI development, collective algorithms, networking, and parallel I/O. He is responsible for supporting MPI on Livermore's Linux clusters. He is a project lead for the Scalable Checkpoint / Restart library and mpiFileUtils -- two projects that use MPI to help users manage large data sets. In recent work, he has been investigating how to employ MPI and fast storage in deep learning frameworks like LBANN.

Abstract

The requirements for interactive visualization at scale rely on efficient communication strategies: distributed data must be transformed on-demand to form the desired view, which must then be rendered to pixels and sent to a remote display, all within a 10 to 100 ms time window. Such analysis has traditionally been done as a post-process step, but the increasing relative cost of touching disk has driven adoption of visual analysis methods that read data directly from simulation memory. These "in situ" methods complicate communication and data movement, as they operate alongside a simulation and often share physical and logical resources. Since both simulation and visualization processes typically use MPI for interprocess communication, the MPI layer presents a natural avenue to facilitate exchange between simulation and analysis. This talk discusses opportunities for communication-layer assistance to in situ applications through extensions to the current MPI specification.


Bio

Paul Navrátil

Dr. Paul A. Navrátil is an expert in high-performance visualization technologies, accelerator-based computing and advanced rendering techniques at the Texas Advanced Computing Center (TACC) at The University of Texas at Austin. His research interests include efficient algorithms for large-scale parallel visualization and data analysis (VDA) and innovative design for large-scale VDA systems. Dr. Navrátil’s recent work includes algorithms for large-scale distributed-memory ray tracing. This work enables photo-realistic rendering of the largest datasets produced on supercomputers today. He is Director of Visualization at TACC, where he leads research and resource provision for remote systems and local human-data interaction (HDI) environments. Dr. Navrátil's work has been featured in numerous venues, both nationally and internationally, including the New York Times, Discover, and PBS News Hour. He holds BS,MS and Ph.D. degrees in Computer Science and a BA in Plan II Honors from the University of Texas at Austin.

Abstract

The Juelich Supercomputing Centre operates some of the largest supercomputers in Europe, with an installed capacity of over 17 petaflops. Like many other supercomputing centers, it has a variety of MPI runtimes available to its users. JSC is also the proponent of the Modular Supercomputing Architecture. The MSA has certain implications in the development of MPI runtimes. This talk will present a brief analysis of the performance of the MPI runtimes presently installed in the different supercomputers/modules. Looking to the future, the particularities of JSC's roadmap and users, and how that affects it's choice of MPI runtimes will also be introduced.


Bio

Damian Alvarez

Dr. Damian Alvarez joined the Jülich Supercomputing Centre (JSC) in 2011. There he is the s cientific software manager of production systems. He is also part of the ExaCluster Laboratory, a collaboration effort between JSC, Intel and ParTec that investigates novel technologies to reach Exascale, including the DEEP, DEEP-ER and DEEP-EST projects. His research interests include optimization in manycore processors, system architecture, novel programming models for high performance computing, PGAS languages, collectives optimization and management of scientific software in supercomputers.

12:30 - 12:45

Group Photo

12:45 - 1:30

Lunch

Abstract

The latest revolution in HPC and AI is the effort around the co-design collaboration, a collaborative effort among industry thought leaders, academia, and manufacturers to reach Exascale performance by taking a holistic system-level approach to fundamental performance improvements. Co-design recognizes that the CPU has reached the limits of its scalability, and offers an intelligent network as the new “co-processor” to share the responsibility for handling and accelerating application workloads. The session will describe the latest technology development and performance results from latest large scale deployments.


Bio

Gilad Shainer

Gilad Shainer is the vice president of marketing at Mellanox Technologies since March 2013. Previously, Mr. Shainer was Mellanox's vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer holds several patents in the field of high-speed networking and contributed to the PCI-SIG PCI-X and PCIe specifications. Gilad Shainer holds a MSc degree (2001, Cum Laude) and a BSc degree (1998, Cum Laude) in Electrical Engineering from the Technion Institute of Technology in Israel.

Abstract

The Open Fabrics Interface (OFI) was envisioned and created in order to provide applications high-level application oriented communication semantics. Several application domains like MPI, PGAS, Streaming and RPC models are considered when designing which semantics OFI should support. While OFI provides applications a simpler to use interface, it also provides fabric vendors a faster innovation cycle and reduced software maintenance costs. OFI is completely vendor agnostic and fits a variety of underlying Fabric hardware design models underneath. Of the many hardware design models available, the Verbs based model has been around for several decades, starting with the VIA interface. In this talk, we focus on the development status and performance of OFI over fabrics that were designed using the Verbs model, such as InfiniBand, iWarp and RoCE.


Bio

Sayantan Sur

Sayantan Sur is a Software Engineer at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.

Abstract

Applications, programming languages, and libraries that leverage sophisticated network hardware capabilities have a natural advantage when used in today¹s and tomorrow¹s high-performance and data center computer environments. Modern RDMA based network interconnects provides incredibly rich functionality (RDMA, Atomics, OS-bypass, etc.) that enable low-latency and high-bandwidth communication services. The functionality is supported by a variety of interconnect technologies such as InfiniBand, RoCE, iWARP, Intel OPA, Cray’s Aries/Gemini, and others. Over the last decade, the HPC community has developed variety user/kernel level protocols and libraries that enable a variety of high-performance applications over RDMA interconnects including MPI, SHMEM, UPC, etc. With the emerging availability HPC solutions based on ARM CPU architecture, it is important to understand how ARM integrates with the RDMA hardware and HPC network software stack. In this talk, we will update MVAPICH community about the current state-of-art ARM architecture and HPC software stack development. We will share most recent performance result for MPI and OpenSHMEM programming models on ARM and share our experience in enabling RDMA software stack and one-sided communication libraries on ARM.


Bio

Pavel Shamis

Pavel is a Principal Research Engineer at ARM with over 18 years of experience in development HPC solutions. His work is focused on co-design software and hardware building blocks for high-performance interconnect technologies, development communication middleware and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domain including: Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was responsible for development HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a recipient of prestigious R&D100 award for his contribution in development of the CORE-Direct collective offload technology.

3:00 - 3:45

Break and Student Poster Session

Reduction Operations on Modern Supercomputers: Challenges and Solutions - Mohammadreza Bayatpour, The Ohio State University
Parallel Read Error Correction for Big Genomic Datasets - Sriram Chockalingam, Georgia Institute of Technology
Accelerating Big Data Processing in the Cloud with Scalable Communication and I/O Schemes - Shashank Gugnani, The Ohio State University
Designing Shared Address Space MPI libraries in the Many-core Era - Jahanzeb Hashmi, The Ohio State University
Checkpoint/Restart MPI applications over Cray network via Proxies - Twinkle Jain, Northeastern University
High-Performance and Scalable Fabric Analysis, Monitoring and Introspection Infrastructure for HPC - Pouya Kousha, The Ohio State University
MPI Performance Engineering and Monitoring using the MPI Tools Interface - Aurele Maheo, University of Oregon
Dynamically Configurable Data-path for Memory Objects and Memory Object Flow - Rafael Oliveira, Georgia Institute of Technology
Efficient Asynchronous Communication Progress for MPI without Dedicated Resources - Amit Ruhela, The Ohio State University
Designing High-Performance, Resilient and Hybrid Key-Value Storage for Modern HPC Clusters - Dipti Shankar, The Ohio State University
Approximate Sequence Matching Algorithm to Handle Bounded Number of Errors - Neda Tavakoli, Georgia Institute of Technology

Abstract

In this talk, we will focus on the experiences with MVAPICH2 MPI libraries, and demonstrate some results of using MVAPICH2 on the Huawei systems, which may be beneficial for the MVAPICH2 community.


Bio

Pak Lui

Pak Lui is a Principal Architect in the Silicon Valley Research Lab at Huawei. He has been involved in demonstrating application performance on various open source and commercial applications. His HPC experience involves in characterizing HPC workloads, analyzing MPI profiles to optimize on the HPC applications, as well as exploring new technologies, solutions and their effectiveness on real HPC workloads. Previously Pak worked as a Senior Manager at Mellanox Technologies in Silicon Valley where his main focus is to optimize HPC applications on products, explore new technologies and solutions and their effect on real workloads. Pak has been working in the HPC industry for over 17 years. Prior to joining Mellanox Technologies, Pak worked as a Cluster Engineer at Penguin Computing, responsible for building HPC cluster configurations from different vendor hardware and ISV software. Pak holds a B.Sc. in Computer Systems Engineering and a M.Sc. in Computer Science from Boston University.

Abstract

SDSC's Comet cluster features 1944 nodes with Intel Haswell processors and two types of GPU nodes:1) 36 nodes with Intel Haswell CPUs (2-socket, 24 cores) with 4 NVIDIA K-80 GPUs (two accelerator cards) each, and 2) 36 nodes with Intel Broadwell CPUs (2-socket, 28 cores) with 4 NVIDIA P100 GPUs on each. The system supports containers via Singularity and also has a capability for spinning up virtual clusters. Comet has several MVAPICH2 installations available for users. We will present microbenchmark and application performance results using MVAPICH2-GDR on the GPU nodes, and MVAPICH2-Virt with Singularity.


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems.

Abstract

NVIDIA was motivated by the need to create an analog of uniform virtual memory across both the host and the GPU device. This relieves the programmer from the burden of memory management, since the CUDA device driver transparently copies memory between GPU device and CPU host. While several successes were previously reported in the literature, progress on checkpointing GPUs came to a halt with the need to support CUDA's new unified virtual memory model, analogous to traditional virtual memory. A new system, CRUM, built on top of DMTCP, addresses this challenge by using the hardware virtual memory subsystem of the host to decouple computation state from the GPU device driver context. We then briefly observe how a similar strategy can be used on the Cori supercomputer at NERSC to support transparent checkpointing of the Cray-proprietary GNI interconnection network for MPI. This approach also allows one to checkpoint over one implementation of MPI, and then restart under a second and different implementation of MPI.


Bio

Gene Cooperman and Rohan Garg

Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from U. of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor since 1992. In 2014, he was awarded a five-year IDEX Chair of Attractivity at from the Université Fédérale Toulouse Midi-Pyrénées, France. Since 2004, he has led the DMTCP project (Distributed MultiThreaded CheckPointing). Prof. Cooperman also has a 15-year relationship with CERN, where his work on semi-automatic thread parallelization of task-oriented software is included in the million-line Geant4 high-energy physics simulator. His current research interests emphasize studying the limits of transparent checkpoint-restart. Some current domains of interest are: supercomputing, cloud computing, engineering desktops (license servers, etc.), GPU-accelerated graphics, GPGPU computing, and the Internet of Things.

4:45 - 5:15

Open MIC Session

6:00 - 9:30

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD

Columbus, OH 43212

Wednesday, August 08

7:45 - 8:30

Registration and Continental Breakfast

Abstract

This talk will describe how the supercomputers are enabling scientific discovery at SDSC for a decade: 2011 – 2021. Tens of thousands of researchers and across a wide range of domains have performed simulations and data processing on the three NSF funded supercomputers at SDSC – Trestles, Gordon and Comet. Researchers from traditional fields of astrophysics, biochemistry, geosciences, and engineering, non-traditional fields of neuroscience, social science, humanities and arts, data science fields of biomedical image processing, text processing, and genomics deep learning, and (high-throughput computing based) data processing fields of high energy physics (e.g. ATLAS, CMS experiments), multi-messenger astronomy (e.g. LIGO), and high-precision neutrino measurements (e.g. IceCube experiment) have all utilized these machines effectively. High performance interconnects and large scale parallel I/O subsystems have played a tremendous role in enabling these simulations and data processing. The MVAPICH2 library and the RDMA-based High-Performance Big Data project have been integral parts of the system software stack to make optimal use of these high performance hardware components.


Bio

Amitava Majumdar

Amit Majumdar is the Division Director of the Data Enabled Scientific Computing (DESC) division at the San Diego Supercomputer Center (SDSC) and an Associate Professor in the Department of Radiation Medicine and Applied Sciences at the University of California San Diego.  His research interests are in high performance computing, computational science, cyberinfrastructure and science gateways. He has developed parallel algorithms and implemented them on various kinds of HPC machines and is interested in understanding performance and scalability of scientific applications on HPC machines. He is the PI of multiple research projects funded by the NSF, NIH, DOD, AFOSR and industries such as Intel, Microsoft. He received his bachelor’s degree in Electronics and Telecommunication Engineering from the Jadavpur University, Calcutta, India, and master’s degree in Nuclear Engineering from the Idaho State University, Pocatello, ID and doctoral degree in the interdisciplinary program of Nuclear Engineering and Scientific Computing from the University of Michigan, Ann Arbor, MI.

Abstract

Over the last several years, OpenHPC has emerged as a community-driven stack providing a variety of common, pre-built ingredients to deploy and manage an HPC Linux cluster including provisioning tools, resource management, I/O clients, runtimes, development tools, and a variety of scientific libraries. Formed initially in November 2015 and formalized as a Linux Foundation project in June 2016, OpenHPC has been adding new software components and now supports multiple OSes and architectures. This presentation will present an overview of the project, currently available software, and highlight recent changes to packaging conventions along with general project updates and future plans.


Bio

Karl Schulz

Karl W. Schulz received his Ph.D. in Aerospace Engineering from the University of Texas in 1999. After completing a one-year post-doc, he transitioned to the commercial software industry working for the CD-Adapco group as a Senior Project Engineer to develop and support engineering software in the field of computational fluid dynamics (CFD). After several years in industry, Karl returned to the University of Texas in 2003, joining the research staff at the Texas Advanced Computing Center (TACC), a leading research center for advanced computational science, engineering and technology. During his 10-year tenure at TACC, Karl was actively engaged in HPC research, scientific curriculum development and teaching, technology evaluation and integration, and strategic initiatives serving on the Center's leadership team as an Associate Director and leading TACC's HPC group and Scientific Applications group during his time at TACC. Karl also served as the Chief Software Architect for the PECOS Center within the Institute for Computational Engineering and Sciences (ICES), a research group focusing on the development of next-generation software to support multi-physics simulations and uncertainty quantification. In 2014, Karl joined the Data Center Group at Intel working on the architecture, development, and validation of HPC system software. Serving as a Principal Engineer, Karl led the technical design and release of OpenHPC, a Linux Foundation community project focused on the integration of common building-blocks for HPC systems. He continues to be actively engaged in this project, currently serving as the overall Project Leader. In 2018, Karl returned to the University of Texas in an interdisciplinary role as a Research Associate Professor within ICES and Associate Professor within the Women's Health Department at the Dell Medical School.

Abstract

High-performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job place-holders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. We discuss the importance of a scalable and general-purpose Pilot systems to support high-performance workflows. We describe the formal properties of a Pilot system and discuss the design, architecture and implementation of RADICAL-Pilot. We discuss how RADICAL-Pilot has been integrated with other application-level tools as a runtime system, and thus its value as an important “building block” for high-performance workflows.


Bio

Shantenu Jha

Shantenu Jha is an Associate Professor of Computer Engineering at Rutgers University, and Department Chair for Data Driven Discovery at Brookhaven National Laboratory. His research interests are at the intersection of high-performance distributed computing and computational science. Shantenu leads the the RADICAL-Cybertools project which are a suite of middleware building blocks used to support large-scale science and engineering applications. He collaborates extensively with scientists from multiple domains -- including but not limited to Molecular Sciences, Earth Sciences and High-Energy Physics. He was appointed a Rutgers Chancellor's Scholar (2015-2020) and was the recipient of the inaugural Chancellor's Excellence in Research (2016) for his cyberinfrastructure contributions to computational science. He is a recipient of the NSF CAREER Award (2013) and several prizes at SC'xy and ISC'xy. More details can be found at http://radical.rutgers.edu/shantenu

10:30 - 11:00

Break

Abstract


Bio

Sushil Prasad

Sushil K. Prasad is a Program Director at National Science Foundation in its Office of Advanced Cyberinfrastructure (OAC) in Computer and Information Science and Engineering (CISE) directorate. He is an ACM Distinguished Scientist and a Professor of Computer Science at Georgia State University. He is the director of Distributed and Mobile Systems Lab carrying out research in Parallel, Distributed, and Data Intensive Computing and Systems. He has been twice-elected chair of IEEE-CS Technical Committee on Parallel Processing (TCPP), and leads the NSF-supported TCPP Curriculum Initiative on Parallel and Distributed Computing for undergraduate education.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's new features including support for the MPI Tools (MPI_T) interface for interfacing with MPI's performance and control variables exported by MVAPICH, OMPT TR6 for OpenMP instrumentation, and APIs for instrumentation of Python, Kokkos, and CUDA applications. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects, automatic instrumentation at the source-code level, CUDA, OpenCL, and OpenACC instrumentation. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, and compiler optimizations. He serves as the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

Abstract

Weather and Hydrological models like the Weather Research and Forecast Model (WRF) and its hydrological counterpart WRF-Hydro, represent an important class of software used for both, research and operational forecasting applications. Both models heavily rely on MPI to perform inter-process communication and their performance are strongly related to the MPI performance. In this talk, WRF and WRF-Hydro are used to evaluate the performance of the MPI implementations installed on Cheyenne, the current NCAR supercomputer.


Bio

Alessandro Fanfarillo

Dr. Alessandro Fanfarillo is a Senior Software Engineer at the National Center for Atmospheric Research. His work is mostly devoted to performance enhancement of parallel weather models, GPGPU computing and software design/refactoring of scientific codes. His research focuses on how to exploit heterogeneous architectures CPU+Accelerators and Partitioned Global Address Space (PGAS) languages (in particular coarray Fortran) for scientific purposes. He is also the lead developer of OpenCoarrays, the open-source library that implements the coarray support in the GNU Fortran compiler.

Abstract

Joint Center for Advanced High Performance Computing (JCAHPC), a joint organization by University of Tsukuba and the University of Tokyo, has been operating Oakforest-PACS system which is one of the largest cluster with a coupling of Intel Knights Landing (KNL) CPU and Intel Omni Path Architecture (OPA) to provide 25 PFLOPS of theoretical peak performance. It is ranked at No. 6 in Top500 List on Nov. 2016 and No. 1 in IO500 List on Nov. 2017 and Jun. 2018. The system is built on a number of new technologies such as new many-core architecture, new interconnection, high throughput I/O, and so on. In this talk, several challenges on Oakforest-PACS system to achieve efficient computation and communication as well as I/O performance is presented. I will also introduce a couple of results on Oakforest-PACS system using MVAPICH2.


Bio

Toshihiro Hanawa

Toshihiro Hanawa is an associate professor of Supercomputing Research Division, Information Technology Center, The University of Tokyo, which he joined in 2013. Before joining the University of Tokyo, he was an associate professor of Center for Computational Sciences, University of Tsukuba, and he has been playing the role as the chief architect for HA-PACS/TCA system which is high-density GPU cluster employing proprietary inter-GPU communication over the compute node using FPGA. From 2016, he has been working as the system operation member of Oakforest-PACS under JCAHPC. He received the M.E. degree and the Ph.D. degree in computer science from Keio University.

12:45 - 1:30

Lunch

Abstract

Communication traces are increasingly important, both for parallel applications’ performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead with the job scale. We propose a hybrid static-dynamic method that leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. Our proposed scheme, CYPRESS, extracts a program communication structure tree at compile time using inter-procedural analysis. This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to “fill in”, in a “top-down” manner, event details into the known communication template. Results show that CYPRESS reduces intra-process and inter-process compression overhead up to 5X and 9X respectively over state-of-the-art dynamic methods, while only introducing very low compiling overhead.


Bio

Jidong Zhai

Jidong Zhai is an associate professor in the Computer Science Department of Tsinghua University and a visiting assistant professor of Stanford University (2015-2016). He received the Ph.D. degree in Computer Science from Tsinghua University in 2010, with the Excellent Ph.D. Graduate Student Award of Tsinghua University. His research is focusing on high performance computing, compiler optimization, performance analysis and optimization of large-scale parallel applications. His research received a Best Paper Finalist at SC’14. The team led by him has achieved 7 champion titles of student supercomputing challenges at SC, ISC, and ASC. He served or is now serving TPC members or reviewers of IEEE TPDS, IEEE TCC, SC, ICS, PPOPP, ICPP, NAS, LCPC, and HPCC. He is a co-chair of ICPP PASA workshop 2015 and a program co-chair of NPC 2018. He is a recipient of 2010 Siebel Scholar and CCF outstanding doctoral dissertation award.

Abstract

Accelerating modelling and simulation in the data deluge era requires the appropriate hardware and infrastructure at scale. The University of Luxembourg is active since 2007 to develop its own infrastructure and expertise in the HPC and BD domains. The current state of developments will be briefly reviewed in the context of the national and European HPC strategy in which Luxembourg is starting to play a role.


Bio

Sébastien Varrette

Sébastien Varrette obtained his joint Ph.D. degree with great distinction in Computer Sciences between the University of Luxembourg (UL) and the University of Grenoble (INPG, France) in 2007. Afterwards, he joined Prof. Pascal Bouvry within the Parallel Computing and Optimization Group (PCOG) at the University of Luxembourg as Research Scientist, where he is leading the development of the University’s HPC platform, as well as the associated expert team of system administrators managing and supporting it. His main research interests lie in the domains of the security and performance of parallel and distributed computing platforms, such as HPC or Cloud Computing infrastructures. Dr Varrette is active as general chair, track chair or as scientific committee member for various reference conferences and technical workgroups within his area of expertise (IEEE CloudCom, CloudNet, IPDPS, Europar, Big Data Congress, Optim, PCGrid, ICPADS, HPCS, CLUS, J. Supercomputing etc.). He takes part for the management committee and represents Luxembourg within multiple EU projects, such as PRACE (acting advisor), ETP4HPC or several COST actions (for instance IC1305 NESUS: Network for Sustainable Ultrascale Computing). He’s also one of the national ICT standardization delegate within ISO/TC 307: Blockchain and distributed ledger technologies. Up to now, Dr. Varrette co-authored four books and more than nine book chapters in Computer Science. He also wrote about 80 research or popularization articles in scientific journals, or international conference proceedings.

Abstract

Eulerian-Lagrangian couplings are nowadays widely used to address engineering and technical problems. In particular, CFD-DEM couplings have been successfully applied to study several configurations ranging from mechanical, to chemical and environmental engineering. However, such simulations are normally very computationally intensive, and the execution time represents a major issue for the applicability of this numerical approach to complex scenarios. With this work, we introduce a novel coupling approach aiming at improving the performance of the parallel CFD-DEM simulations. This strategy relies on two points. First, we propose a new partition-collocation strategy for the parallel execution of CFD–DEM couplings, which can considerably reduce the amount of inter-process communication between the CFD and DEM parts. However, this strategy imposes some alignment constraints on the CFD mesh. Secondly, we adopt a dual-grid multiscale scheme for the CFD-DEM coupling, that is known to offer better numerical properties, and that allows us to obtain more flexibility on the domain partitioning overcoming the alignment constraints. We assess the correctness and performance of our approach on elementary benchmarks and at a large scale with a realistic test-case. The results show a significant performance improvement compared to other state-of-art CFD-DEM couplings presented in the literature.


Bio

Xavier Besseron

Xavier Besseron is a permanent Research Scientist at the University of Luxembourg. He graduated in 2010 from the Grenoble University (France) with a PhD on Computer Science. His PhD work was on Fault Tolerance and Dynamic Reconfiguration for large scale distributed applications. From October 2010 to September 2011, his was a postdoc Researcher at the Ohio State University (USA) under the supervision of Prof. D.K. Panda in the Network Based Computing Lab (NOWLAB). During that time, he contributed to the MVAPICH project, a High Performance implementation of MPI for Infiniband clusters. In October 2010, he joined the University of Luxembourg, first as a Postdoc Research in the Parallel Computing & Optimisation Group (PCOG) of Prof. Pascal Bouvry. Now he's part of the Luxembourg XDEM Research Centre (LuXDEM) and works under the supervision of Prof. Bernhard Peters on the optimisation and parallelization of eXtended Discrete Element Method (XDEM). His research interests are High Performance Computing and Computational Sciences, and in particular parallelization, optimization, debugging and coupling of scientific HPC applications.

2:40 - 3:15

Break

Abstract

OSU INAM monitors InfiniBand clusters consisting of several thousands of nodes in real time by querying various subnet management entities in the network. It is also capable of interacting with the MVAPICH2-X software stack to gain insights into the communication pattern of the application and classify the data transferred into Point-to-Point, Collective and Remote Memory Access (RMA). OSU INAM can also remotely monitor several parameters of MPI processes such as CPU/Memory utilization, intra- and inter-node communication buffer utilization etc in conjunction with MVAPICH2-X. OSU INAM provides the flexibility to analyze and profile collected data at process-level, node-level, job-level, and network-level as specified by the user. In this demo, we demonstrate how users can take advantage of the various features of INAM to analyze and visualize the communication happening in the network in conjunction with data obtained from the MPI library. We will, for instance, demonstrate how INAM can 1) filter the traffic flowing on a link on a per job or per process basis in conjunction with MVAPICH2-X, 2) analyze and visualize the traffic in a live or historical fashion at various user-specified granularity, and 3) identify the various entities that utilize a given network link.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of IEEE. More details about Dr. Subramoni is available at http://www.cse.ohio-state.edu/~subramon.

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for GPU-enabled clusters in MVAPICH2-GDR and KNL-based systems in MVAPICH2-X will be presented. The impact on performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to application redesign will be presented to take advantage of hybrid MPI+PGAS programming models.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of IEEE. More details about Dr. Subramoni is available at http://www.cse.ohio-state.edu/~subramon.