MUG'20

(Final Program)

All Times Are U.S. EDT

Monday, August 24

Abstract

This hands-on tutorial will jumpstart developers with Arm's performance engineering tools for MVAPICH2 and Arm's Scalable Vector Extension (SVE). Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. We will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. We will also explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications. We will also introduce Arm's tools for performance investigation of MPI programs that use Arm's Scalable Vector Extension (SVE) and demonstrate how these tools may be used with MVAPICH2.


Bio

John Linford

John Linford is Arm's director for HPC applications. He leads Arm's global HPC field engineering team and explores and predicts the performance of future architectures relevant to HPC. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation. He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

Abstract

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors intelligent network devices, which manipulate data traversing the data-center network, SHARP technology designed to offload collective operation processing to the network. This tutorial will provide an overview of SHARP technology, new features including high bandwidth Streaming Aggregation and performance results on Selene Supercomputer.


Bio

Devendar Bureddy

Devendar Bureddy is a Principal SW Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHARP,UCX, HCOLL..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory led by Dr. D. K. Panda, involved in the design and development of MVAPICH. He had received his Master’s degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC/DL software.

Abstract

OSU INAM can remotely monitor several parameters of MPI processes such as CPU/Memory utilization, Lustre I/O, intra- and inter-node communication buffer utilization etc. in conjunction with MVAPICH2-X. It also provides the flexibility to analyze and profile collected data at process-level, node-level, job-level, and network-level as specified by the user.
In this talk, we demonstrate how users can take advantage of the various features of OSU INAM to analyze and visualize the communication happening in the network and the I/O fabric in conjunction with data obtained from the MPI library. We will, for instance, demonstrate how INAM can 1) filter the traffic flowing on a link on a per job or per process basis in conjunction with MVAPICH2-X, 2) analyze and visualize the traffic in a live or historical fashion at various user-specified granularity, 3) identify the various entities that utilize a given network link, and 4) demonstrate how you can performance engineer your application to identify communication operations in specified "areas of interest".


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

12:30 - 1:00

Break

Abstract

The rapid adoption of Cloud Computing is one of the biggest changes the HPC community has witnessed over the last few years in terms of both business model as well as technology. It has also created an avenue of access to levels of performance many HPC teams didn't have even just a couple of years ago. One of the key reasons for the rapid Cloud adoption is because Cloud Computing technologies have matured to the point that Cloud now offers similar performance as that of on-prem environments. Moreover, Cloud technologies offer unprecedented flexibility for users who can spin-up VMs with specific architectural requirements and/or use cloud bursting to increase the capacity of their on-prem infrastructure.

However, cloud-specific jargons, myriad of deployment models, subtle differences in architecture, nuances in networking configurations, etc. may result in a taxing first experience for users. This tutorial aims to alleviate these challenges and help the onboarding process smooth.

We will start this tutorial with an overview of various Microsoft Azure HPC offerings, followed by discussing different VM deployment models focusing on HPC workloads. We will also provide a live demo of creating an Azure HPC cluster that supports flexible scaling. We will discuss performance and scalability characteristics of Azure HPC using MPI microbenchmarks as well as real-world HPC applications. We will also cover best practices and recommendations for getting the performance and scalability results on the Microsoft Azure HPC platform.


Bio

Jithin Jose and Jon Shelley

Dr. Jithin Jose is a Senior Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and designing communication runtimes that seamlessly expose hardware capabilities to programming models and middleware. His research interests include high performance interconnects and protocols, parallel programming models, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

Jon Shelley is the HPC and AI Benchmarking Principal Manager for the Azure Compute Team. He works closely with customers and Independent Software Vendors to aid them in running successfully in Azure. Jon has been working in the HPC field since 2002 when he joined Pratt and Whitney doing CFD and FEA simulations on Linux clusters. In 2007, he joined the Idaho National Labs (INL) and lead the HPC user support group. At INL, he helped configure clusters and applications for the user community. In 2013, he joined the PBSPro team at Altair and worked with HPC centers around worlds to help them manage their current HPC workloads and design for their future workloads.

Abstract

Thor is industry’s first PCIe 4.0 200 Gigabit Ethernet Controller designed for mainstream high performance computing, networking, and storage applications. Based on NetXtreme® E-Series architecture, Thor provides higher performance and scalability for HPC and Machine Learning (ML) applications. Thor has a unique set of RDMA over Converged Ethernet (RoCE), virtualization, flow processing, stateless offloads, multi-host, security, and system management features. In this tutorial, we will provide an architecture overview of Thor and highlight key Thor features for HPC applications.


Bio

Hemal Shah, Karen Schramm, and Moshe Voloshin

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Compute and Connectivity (CCX) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of Ethernet NIC product architecture and software roadmap/architecture of performance NICs/Smart NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades.
Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Karen Schramm is VP of Architecture and ASICs for Broadcom’s CCX Division. She leads a team of architects and silicon engineers in defining and designing industry leading performance NICs, SmartNICs and Compute Offload engines. Karen led the definition of CCX’s cloud optimized NIC architecture including driving the TruFlow packet processor and RoCE architectures.
Previously Karen was an architect in Broadcom’s CSG switch division, VP of Engineering at Sandburst (a semiconductor start-up doing NPUs & switches), Principal Engineer at Ironbridge (a terabit router start-up) and Principal Engineer at GTE working on cryptography and networking systems. She has patents in networking & CDMA technology and an MSEE from Northeastern University.

Moshe Voloshin is Systems architect in the Compute and Connectivity (CCX) division at Broadcom Inc. Moshe spearheaded the system architecture development of ROCE and Congestion Control in Broadcom NICs, involved in definition of Ethernet NIC product architecture, performance NICs/Smart NICs, modeling and system simulations.
Previously Moshe was a Director, manager and ASIC and HW engineer at Cisco High End router division where he developed and managed the development of Network Processing Unit (NPU), QOS, and fabric ASICs, in products such as GSR and CRS.

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for NVIDIA/AMD GPU-enabled clusters in MVAPICH2-GDR and many-core systems in MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications from HPC and AI to demonstrate how one can effectively take advantage of MVAPICH2 in HPC and AI applications using MPI and CUDA/OpenACC will also be presented.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

Tuesday, August 25

10:00 - 10:15

Opening Remarks

Dave Hudak, Executive Director, Ohio Supercomputer Center
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

Scientific computing can generate data sets with massive numbers of enormous samples. Efficiently training deep neural networks on for these applications requires combinations of model-, data-, and ensemble-parallelism to reduce the time to train a converged model. We will present LBANN’s unique capabilities that leverage scalable distributed memory training algorithms as well as large-scale platforms such as the Sierra supercomputer at LLNL for better strong scaling. Specifically, we will describe the communication challenges and solutions when parallelizing across these multiple dimensions, including examples from generalized parallel convolutions and sub-graph parallelism. We demonstrate these challenges on multiple applications, including scaling up the size of the 3D data cube used for training neural network that predicts cosmological constants.


Bio

Brian van Essen

Brian Van Essen is the informatics group leader and a computer scientist at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory (LLNL). He is pursuing research in large-scale deep learning for scientific domains and training deep neural networks using high-performance computing systems. He is the project leader for the Livermore Big Artificial Neural Network open-source deep learning toolkit, and the LLNL lead for the ECP ExaLearn and CANDLE projects. Additionally, he co-leads an effort to map scientific machine learning applications to neural network accelerator co-processors as well as neuromorphic architectures. He joined LLNL in 2010 after earning his Ph.D. and M.S. in computer science and engineering at the University of Washington. He also has an M.S and B.S. in electrical and computer engineering from Carnegie Mellon University.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X and MVAPICH2-GDR) for HPC and Deep Learning will be presented. Features and releases for AMD GPUs, Broadcom RoCEv2 Adapters, Microsoft Azure and Amazon AWS (with EFA adapters) will also be presented. Current status and future plans for OSU INAM, OMB, and Best Practices Page will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, Omni-Path and EFA) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,100 organizations worldwide (in 89 countries). More than 797,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 4th, 8th, 12th, 18th, 19th, 22nd and 40th ranked ones) in the TOP500 list. The group has also been focusing on accelerating popular Deep Learning Frameworks (TensorFlow, PyTorch and MXNet) using the MVAPICH2-GDR library and co-designing these frameworks. These solutions are available from the High-Performance Deep Learning (HiDL, http://hidl.cse.ohio-stae.edu) Project site. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu)are also publicly available. These libraries are currently being used by more than 330 organizations in 36 countries. More than 37,600 downloads of these libraries have taken place. He is an IEEE Fellow.More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

Abstract

High performance computing and Artificial Intelligence are the most essential tools fueling the advancement of science. NVIDIA Networking technologies are the engine of the modern HPC data center. Mellanox HDR InfiniBand enables extremely low latencies, high data throughput, and includes high-value features such as smart In-Network Computing acceleration engines via Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology, high network resiliency through SHIELD's self-healing network capabilities, MPI offloads, enhanced congestion control and adaptive routing characteristics. These capabilities deliver leading performance and scalability for compute and data-intensive applications, and a dramatic boost in throughput and cost savings, paving the way to scientific discovery.


Bio

Gilad Shainer

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

12:30 - 1:00

Break

Abstract

Amazon Web Services launched Elastic Fabric Adapter (EFA), a high-performance network interface for EC2 instances, in early 2019. This talk will present an overview of the EFA architecture and the underlying Scalable Reliable Datagrams (SRD) protocol, review the features and enhancements that have been developed since the initial launch, review the set of new server types that support EFA, and present some case studies with real customer workloads that leverage EFA to satisfy their demanding networking needs. The talk will also dive-deep into the MPI ecosystem on AWS and the MVAPICH team's EFA-enhanced MV2-X-AWS solution.


Bio

Raghunath Rajachandrasekar

Raghunath is a Senior Engineer at Amazon Web Services where he is the Technical Lead for a team that builds HPC technologies to run applications more efficiently in the cloud. He leads the libfabric development efforts at AWS and serves as the maintainer for the Elastic Fabric Adapter provider. Prior to AWS, he was a Senior Engineer at Cray, conducting research on next-generation HPC storage technologies as part of an Advanced Development group. Raghunath received his PhD from The Ohio State University. He has authored more than 20 scientific publications, presented talks at several international venues, and serves on the Technical Program Committees and Steering Committees of multiple HPC conferences and workshops.

Abstract

Cloud Computing is democratizing cutting edge technology with unprecedented performance, scalability, and cost-efficiency. Cloud offers resources that are ever-ready and with flexible and instant scaling options. It has made possible that the latest technologies and architectures are available to everyone, even on the very first day they are announced publicly. As the Cloud HPC platform gets more and more advanced, it is critical to have the right software ecosystem such as MPI libraries to get the best performance out of the platform. This talk focuses on achieving best performance and scalability using MVAPICH2 on Azure HPC platform. This talk will provide an overview of the latest HPC offerings in Microsoft Azure HPC along with their performance characteristics. This talk will also cover the Microsoft Azure HPC marketplace images that include MVAPICH2-Azure MPI libraries as well as recommendations and best practices for using MVAPICH2 and MVAPICH2-X on Microsoft Azure. We will also discuss the performance and scalability characteristics using microbenchmark and HPC applications. Finally, we will also demonstrate how to quickly deploy an MVAPICH2 powered cluster on the Microsoft Azure HPC VMs.


Bio

Jithin Jose

Dr. Jithin Jose is a Senior Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and designing communication runtimes that seamlessly expose hardware capabilities to programming models and middleware. His research interests include high performance interconnects and protocols, parallel programming models, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

Abstract

Applications, programming languages, and libraries that leverage sophisticated network hardware capabilities have a natural advantage when used in today's and tomorrow's high-performance and data center computer environments. Modern RDMA based network interconnects provide incredibly rich functionality (RDMA, Atomics, OS-bypass, etc.) that enable low-latency and high-bandwidth communication services. The functionality is supported by a variety of interconnect technologies such as InfiniBand, RoCE, iWARP, Cray’s Aries/Gemini, and others. Over the last decade, the HPC community has developed a variety of user/kernel level protocols and libraries that enable a variety of high-performance applications over RDMA interconnects including MPI, SHMEM, UPC, etc. Most recently, HPC platforms based on Arm architecture demonstrated that Arm-based SoC scale from a smartphone platform to the leading system in the Top-500 Super Computing list. In this talk, we present recent advances in the interconnect research on Arm platforms. We discuss SmartNIC applications for HPC programming models, software and hardware overheads for MPI on Arm, and SVE support in MPI.


Bio

Pavel Shamis

Pavel is a Principal Research Engineer at ARM. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development of multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a recipient of two R&D100 awards for his contribution to the development of the CORE-Direct collective offload technology and OpenUCX communication framework.

Abstract

Heterogeneous architectures are becoming more prevalent as a mainstream compute resource. oneAPI is an industry initiative that Intel is driving along with the community to develop standards-based programming models along with libraries that focus on delivering cross architecture support. Within oneAPI, the Level Zero provides low-level direct-to-metal interfaces that are tailored to the devices in a oneAPI platform. Level Zero supports broader language features such as function pointers, virtual functions, unified memory, and I/O capabilities while also providing fine-grain explicit controls needed by higher-level runtimes. In this talk, we will go over specific functionality in Level Zero that is of interest to implementers of MPI libraries.


Bio

Sayantan Sur

Sayantan Sur is a Principal Engineer at Intel Corp. His work involves High Performance Computing, specializing in Scalable Interconnection fabrics, MPI, PGAS and Open Fabrics Interface. Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.

Abstract

MVAPICH2 software family provides high performance MPI libraries for HPC and Machine learning applications. With the emergence of Thor as a mainstream 100/200 Gigabit Ethernet Controller, HPC/ML applications can now realize the benefits of high-end proprietary networks using standard Ethernet at much lower cost. In this talk, we will share our experience with running MVAPICH2 on Thor. We will highlight key Thor features for MPI applications and provide preliminary performance measurements.


Bio

Hemal Shah, Moshe Voloshin, and Devesh Sharma

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Compute and Connectivity (CCX) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of Ethernet NIC product architecture and software roadmap/architecture of performance NICs/Smart NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades.
Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Moshe Voloshin is Systems architect in the Compute and Connectivity (CCX) division at Broadcom Inc. Moshe spearheaded the system architecture development of ROCE and Congestion Control in Broadcom NICs, involved in definition of Ethernet NIC product architecture, performance NICs/Smart NICs, modeling and system simulations.
Previously Moshe was a Director, manager and ASIC and HW engineer at Cisco High End router division where he developed and managed the development of Network Processing Unit (NPU), QOS, and fabric ASICs, in products such as GSR and CRS.

Devesh Sharma is a software engineer at Broadcom Inc in the Compute and Connectivity (CCX) division. Devesh has worked on RDMA technology for more than 14 years. During this course of time, he led the development of the RoCE driver and RDMA provider library for several generations of multiple Ethernet controller product lines. Devesh has expertise in RDMA applications software stacks and libraries like IB-uVerbs, libfabrics, uDAPL, OFED and other kernel space rdma consumers viz. NFS-RDMA, iSER, NVMe over Fabrics. His work experience also extends to real life use-cases of various MPIs including MVAPICH2. Devesh holds postgraduate diploma in embedded systems design and Bachelors in information technology.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis, and support for path based profiling that uses PVARs to accurately track the path taken by MPI messages on hybrid CPU-GPU systems. TAU's support for GPUs including CUDA, OpenCL, OpenACC, Kokkos, and ROCm improve performance evaluation of heterogeneous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

4:00 - 4:30

Short Presentations

Hardware Tag Matching in MVAPICH2-X, Mohammadreza Bayatpour, The Ohio State University
Efficient Support of Large Message Neighborhood Collectives in MVAPICH2-X, Mahdieh Ghazimirsaeed, The Ohio State University
Accelerating CPU-Based Training for Very large Deep Neural Networks using MVAPICH2-X, Arpan Jain, The Ohio State University

Wednesday, August 26

Abstract

Expanse is a new 5 PF hybrid Dell cluster at SDSC serving the “long tail of science” through the XSEDE program. Expanse follows in the successful tradition of the current Comet system at SDSC supporting thousands of users and diverse scientific workloads via multiple access methodologies, including traditional batch scheduler, Science Gateways, the Open Science Grid, and interactive notebooks. Expanse adds to this array of capabilities seamless public cloud integration and distributed composable systems managed by Kubernetes, effectively extending the system’s capabilities beyond the machine room. In this talk I describe the system hardware architecture, software environment, and operational policies designed to support Comet’s current user base, and expand to new user communities. Expanse is one of the first XSEDE systems to deploy AMD’s 7nm Epyc processors with 128 cores/node, which are well-suited to long-tail workloads. Expanse also incorporates NVIDIA V100 GPUs for the emerging class of ML/DL workloads. Expanse is scheduled to go into production October 1, 2020. For more information, see http://expanse.sdsc.edu. Expanse is funded by the National Science Foundation under award OAC 1928224.


Bio

Michael L. Norman

Dr. Michael L. Norman, named SDSC interim director in June 2009 and appointed to the position of director in September 2010, is a distinguished professor of physics at UC San Diego and a globally recognized astrophysicist. Dr. Norman is a pioneer in using advanced computational methods to explore the universe and its beginnings. In this capacity, he has directed the Laboratory for Computational Astrophysics -- a collaborative effort between UC San Diego and SDSC resulting in the Enzo community code for astrophysics and cosmology in use worldwide. Dr. Norman is the author of over 300 research articles in diverse areas of astrophysics, including star and galaxy formation, the evolution of intergalactic medium, as well as numerical methods. Dr. Norman's work has earned him numerous honors, including Germany's prestigious Alexander von Humboldt Research Prize, the IEEE Sidney Fernbach Award, and several HPCC Challenge Awards. He also is a Fellow of the American Academy of Arts and Sciences, and the American Physical Society. He holds an M.S. and Ph.D. in engineering and applied sciences from UC Davis, and in 1984 completed his post-doctoral work at the Max Planck Institute for Astrophysics in Garching, Germany. From 1986 to 2000, Dr. Norman held numerous positions at the University of Illinois in Urbana, as an NCSA associate director and senior research scientist under Larry Smarr, and as a professor of astronomy. From 1984 to 1986, he was a staff member at Los Alamos National Laboratory. Dr. Norman is the Principal Investigator of two of SDSC’s leading HPC systems—Comet and Expanse—which together represent more than $50M in NSF funding.

Abstract

KISTI's Nurion supercomputer features 8305 nodes with Intel Xeon Phi KNL (Knight Landing) processors (68 cores) and 132 nodes with Intel Skylake CPUs (2-socket, 40 cores). Nurion is a system consisting of compute nodes, CPU-only nodes, Omni-Path interconnect networks, burst buffer high-speed storage, Luster-based parallel file system, and water cooling device based on Rear Door Heat Exchanger (RDHx). We will present microbenchmark and application performance results using MVAPICH2-X with XPMEM on the KNL nodes.


Bio

Minsik Kim

Minsik Kim is a researcher in the Supercomputing Infrastructure Center of the Korea Institute of Science and Technology Information (KISTI). He received the Ph.D. degree in Electrical and Electronic Engineering from Yonsei University in 2019. His research interests include deep learning optimization on GPUs, computer architecture, and high-performance computing. He is a member of IEEE. More details about Dr. Kim is available at http://minsik-kim.github.io.

Abstract

To provide high-performance MPI intra-node communication on multi/many-core systems, kernel-level support has been studied over the past fifteen years. The System Software Laboratory at Konkuk University has studied kernel-level support called LiMIC2 for MVAPICH2. LiMIC2 reduces the number of data copies by means of memory mapping and provides high-bandwidth and low-latency of intra-node communication for large messages. However, additional challenging issues for MPI intra-node communication arise in Exascale computing. To address these issues, the Post-LiMIC2 project has been recently launched in July 2020 as a part of a supercomputer research project funded by the Korean government. This talk will present an overview of the Post-LiMIC2 project, highlight main research focus (e.g., power-efficiency, skew-tolerance, and manageability), and discuss early experimental results.


Bio

Hyun-Wook Jin

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory (SSLab) at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on operating systems for high-end computing systems and cyber-physical systems.

Abstract


Bio

John Cazes

John Cazes joined TACC in March 2005. Prior to TACC, he served as the lead for climate, weather, and ocean HPC modeling at the Navy's supercomputing center (part of the Department of Defense High Performance Computing Modernization Program). He has over 25 years of experience in high performance computing in public and private industry, and currently serves as the director of HPC for TACC where he leads application performance efforts for TACC's 10,000+ user community.

12:30 - 1:00

Break

Abstract

SDSC Comet and Expanse are both targeted at long-tail workloads with both CPU and GPU based nodes. MVAPICH2 and MVAPICH2-GDR enable many applications on the system and this talk will present results from performance studies on both machines. Expanse is one of the first NSF funded (award# OAC 1928224) XSEDE systems to deploy AMD’s 7nm EPYC processors with 128 cores/node. The GPU nodes will incorporate NVIDIA V100 GPUs supporting a broad range of applications in molecular dynamics, bioinformatics, and machine learning/deep learning. Expanse is scheduled for early user access in September and starting production October 1, 2020. This talk will provide an early look at the performance of applications such as NEURON, RAxML, WRF, TensorFlow, and HOOMD-Blue.


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He is co-PI on the SDSC Comet and Expanse HPC systems projects at SDSC.

Abstract

The human brain is a complex multi-level system with 100 billion neurons and its 100 trillion synapses. Building large scale brain circuits is a scientific endeavor and computationally expensive task. The goal of Blue Brain Project at EPFL is to use a data-driven software pipeline to reconstruct biologically detailed models of brain tissue of the rodent brain. Once built, that model is simulated to faithfully reproduce an array of laboratory experiments in silico. BBP utilizes MPI throughout the software stack, and this talk will give an overview of the communication patterns and challenges to detect the connectivity between detailed representations of neurons with hundreds of branches, targeting a scale of brain regions up to 80 million neurons.


Bio

Matthias Wolf

Matthias Wolf is a Big Data software engineer in the High Performance Computing division at the Blue Brain Project. He leads the development of the circuit building software stack to establish the neural connectivity of digitally reconstructed brain regions, and to maintain scalability for future scientific endeavors. To this end, Matthias is working with both MPI as well as modern data analysis frameworks like Apache Spark. Before joining the Blue Brain Project, Matthias concluded his doctoral studies in High Energy Physics at the University of Notre Dame.

Abstract

The OSU InfiniBand Network Analysis and Monitoring (OSU INAM) tool has been running on OSC’s production systems for more than a year. In this talk we’ll give an overview of OSC’s HPC environment and IB fabric and discuss our INAM deployment. We'll include a discussion of OSC’s INAM configuration, support for our job scheduling environment, and experience using INAM to understand communication characteristics for HPC jobs. We’ll follow with a short demo of INAM at OSC.


Bio

Karen Tomko and Heechang Na

Karen Tomko is the Director of Research Software Applications and serves as manager of the Scientific Applications group at the Ohio Supercomputer Center where she oversees deployment of software for data analytics, modeling and simulation. Her research interests are in the field of parallelization and performance improvement for High Performance Computing applications. She has been with OSC since 2007 and has been collaborating with DK Panda and the MVAPICH team for about 10 years.

Heechang Na is a Senior Scientific Applications Engineer at the Ohio Supercomputer Center. He is interested in performance analysis, system monitoring, and research environment development. He received his Ph.D. in computational high-energy physics from Indiana University, Bloomington. Before joining the Ohio Supercomputer Center in 2015, he worked with MILC and HPQCD collaborations in Lattice QCD.

Abstract

The talk will describe research, learning and workforce development (LWD) programs within the Office of Advanced Cyberinfrastructure (OAC) in the CISE directorate at the National Science Foundation. OAC's mission is to support advanced cyberinfrastructure to accelerate discovery and innovation across all science and engineering disciplines. The programs specifically addressed include the CAREER program for faculty early career development, the CISE Research Initiation Initiative (CRII) for early career faculty who have not yet been a PI on a Federal grant, the Cybertraining program for research workforce preparation, and the OAC Core Research Program that is part of the CISE Core Research programs solicitation.


Bio

Alan Sussman

Alan Sussman is currently a program director in the Office of Advanced Cyberinfrastructure at NSF in charge of learning and workforce development programs, and is also active in software and data related cyberinfrastructure programs. He is on leave from his permanent position as a Computer Science professor at the University of Maryland. His research interests have focused on systems software support for large-scale applications that require high performance parallel and distributed computing. In addition, since 2010, he has been helping coordinate a curriculum initiative in parallel and distributed computing with the premise that every undergraduate student in Computer Science or Computer Engineering must acquire basic parallel computing skills. This curriculum has had wide adoption including its direct impact on ACM’s CS2013 Curriculum.

Abstract

This talk will present an overview of two products with enhanced capabilities, being worked out by X-ScaleSolutions, using MVAPICH2 libraries. These products are: 1) DeepIntrospect (DI) for Deep learning applications and 2) SMART communication accelerator module (SMART-CAM) using Mellanox Bluefield Adapters. DeepIntrospect (DI): Modern CPUs/GPUs and high-performance interconnects are creating breakthrough opportunities for AI and Deep Learning (DL) applications. Existing approach to monitor and analyze the performance of and the interplay between the various components in DL applications requires: 1) use of a plethora of disjoint tools, 2) significant manual effort, and 3) expertise to correlate between statistics. DeepIntrospect (DI) is a new tool from X-ScaleSolutions that enhances existing capabilities to monitor, analyze, and understand the performance of DL applications and frameworks. It provides a holistic viewpoint and perspective on the application running behavior and characteristics via an integrated and user-friendly GUI. It allows end users as well as runtime developers to identify performance bottlenecks and optimize their designs to get the best performance and scalability for their DL/AI applications on modern clusters. SMART-CAM: MPI has been the programming model of choice for large-scale parallelism in HPC systems due to its support for efficient network communication. New network adapter SOC (like Mellanox Bluefield) further improves communication by integrating traditional adapter with a sizable array of ARM processors into a single chip. SMART-CAM takes advantage of the features of such an adapter to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and DL applications. It integrates key components enabling new form of computation and communication overlap, enhances the existing production quality MVAPICH2 MPI library middleware. This talk will present an overview of the software architectures of DI and SMART-CAM products, discuss the underlying designs and protocols, demonstrate and highlight some salient features and enhancements that have been developed so far.


Bio

Donglai Dai

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 US granted patents and has published more than 30 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

3:15 - 4:15

Short Presentations

Exploiting AMD GPUs with MVAPICH2-GDR, Jahanzeb Hashmi, The Ohio State University
High-Performance Alltoall and Allgather Support for Dense GPU Systems in MVAPICH2-GDR, Kawthar Shafie Khorassani, The Ohio State University
Scaling Distributed PyTorch and Deepspeed with MVAPICH2-GDR, Quentin Anthony, The Ohio State University
Training of Very Large Pathology Images using MVAPICH2-GDR, Arpan Jain, The Ohio State University
Efficient Communication support for DASK using MVAPICH2-GDR, Aamir Shafi, The Ohio State University
Benefits of GPU-Assisted On-the-fly Message Compression with MVAPICH2-GDR, Qinghua Zhou, The Ohio State University
Parallelizing GPU-accelerated Machine Learning Applications using MVAPICH2-GDR, Aamir Shafi, The Ohio State University

4:15 - 4:30

Open MIC & Conclusions