MUG'23

(Final Program)

All Times Are U.S. EDT

Conference Location: OSU Translational Data Analytics Institute (TDAI), Pomerene Hall, Room #320

Mug Photo Mug Photo

MUG'23 conference attendees gather for a group photo.

Monday, August 21

7:30 - 8:30

Registration and Continental Breakfast

Abstract

The OPX Libfabric provider began as a clone of the bgq provider and has been transformed into the provider for Omnipath 100 HPC fabrics. OPX is a native Libfabric provider that requires no additional messaging libraries. OPX is suitable for use with MPI and OpenSHMEM-based applications. OPX maintains an embedded system focus to be lightweight and minimally intrusive to running applications.

In this tutorial, we will explore some observability features of OPX, and see how they can be useful for HPC application debug. We will then dive into some tunable parameters that affect messaging performance. We will explain what the parameter is doing, why performance is affected, then test these settings by running microbenchmarks and observing behavior and results. We will also discuss why some of the parameters exist and some possible message layer improvements that could enable better performance in certain cases.


Bio

Dennis

Tim Thompson began his software career in 2000 at IBM, on AS/400 kernel. Later IBM projects included work on Hypervisor firmware for POWER hardware, Simics behavior-based hardware simulation, and a networking project based on the Power7 IH (PERCs) supercomputer interconnect targeted to the non-HPC market. Leaving IBM in 2012, Tim worked for a small ecommerce company (Zaneray.com) doing web programming. Tim then co-founded a bioinformatics startup SaaS (https://truwl.com). Tim is currently working as a senior software engineer for Cornelis Networks on the OPX Libfabric provider.

Abstract

The TAU Performance System is a versatile performance evaluation toolkit supporting both profiling and tracing modes of measurement. It supports performance evaluation of applications running on CPUs and GPUs and supports runtime-preloading of a Dynamic Shared Object (DSO) that allows users to measure the performance without modifying the source code or binary. This tutorial will describe how TAU may be used with MVAPICH and support advanced performance introspection capabilities at the runtime layer. TAU's support for tracking the idle time spent in implicit barriers within collective operations will be demonstrated. TAU also supports event-based sampling at the function, file, and statement level. TAU's support for runtime systems such as CUDA (for NVIDIA GPUs), Level Zero (for Intel oneAPI DPC++/SYCL), ROCm (for AMD GPUs), OpenMP with support for OMPT and Target Offload directives, Kokkos, and MPI allow instrumentation at the runtime system layer while using sampling to evaluate statement-level performance data. A hands-on session/demo on AWS with the Extreme-scale Scientific Software Stack (E4S) will be shown.


Bio

Richard

Prof. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, software stacks, HPC container runtimes, and compiler optimizations. He serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc. and ParaTools, SAS.

10:30 - 11:00

Morning Coffee Break

Abstract

NVIDIA's BlueField DPU is a system-on-a-chip with both network and computational capabilities. In this presentation we will introduce the BlueField device and capabilities that make it suitable for application acceleration. Application performance gains will be discussed from the perspective of communication library acceleration and the enablement of computation-communication overlap.


Bio

Broadcom

Dr. Richard Graham is a Senior Director, HPC Technology at NVIDIA's Networking Business unit. His primary focus is on HPC network software and hardware capabilities for current and future HPC technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

12:00 - 1:00

Lunch Break

Abstract

The MVAPICH2-DPU, X-ScaleHPL-DPU, and X-ScaleAI-DPU packages take advantage of the DPU features to offload communication components in the MPI library and Pytorch DL framework and accelerate HPC and AI applications. The packages enhance and integrate key components in MPI library and Pytorch framework, enabling high degree overlap among computation, communication, and IO operations. This tutorial will provide an overview of the MVAPICH2-DPU, X-ScaleHPL-DPU, and X-ScaleAI-DPU products, main features, and acceleration capabilities for a set of representative HPC and AI applications and benchmarks. Live demos will be shown to demonstrate the capabilities of the latest version of MVAPICH2-DPU, X-ScaleHPL-DPU, and X-ScaleAI-DPU products.


Bio

Donglai Kyle Schaefer

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 40 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

Kyle Schaefer is a Software Engineer at X-ScaleSolutions. His current work focuses on continuing the testing, design, and development of the MVAPICH2-DPU project.

Abstract

The tutorial will start with an overview of the MVAPICH libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for NVIDIA/AMD GPU-enabled clusters in MAVPICH-Plus/MVAPICH2-GDR and many-core systems in MVAPICH-Plus/MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications to demonstrate how one can effectively take advantage of MVAPICH for High End Computing applications using MPI and CUDA/OpenACC will also be presented.

Bio

Hari Nat

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE.

Nat Shineman is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the MVAPICH2 library.

Abstract

The OSU Microbenchmark suite is a popular set of benchmarks for evaluating the performance of HPC systems. In this tutorial, we will take the attendees through the new set of features that have been added to OMB - like support for Java-based benchmarks, Python-based benchmarks. We will also discuss the enhancements done for the C benchmarking suite like support for user defined data types, creating graphical representations of output, data validation, support for profiling tools like PAPI, and support for newer MPI primitives like persistent operations, MPI sessions.


Bio

Hari Subramoni Aamir Shafi Aamir Shafi

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE.

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

Akshay Paniraja Guptha is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high-performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the OSU Microbenchmarks and the MVAPICH2 library.

3:00 - 3:30

Afternoon Coffee Break

Abstract

The fields of Machine and Deep Learning (ML/DL) have witnessed remarkable advances in recent years, paving the way for cutting-edge technologies and leading to exciting challenges and opportunities. Modern ML/DL frameworks, including TensorFlow, PyTorch, and cuML, have emerged to offer high-performance training and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL leveraging powerful hardware architectures, interconnects, and distributed frameworks to accelerate the training of ML/DL models, especially as they grow larger and more complicated. We present an overview of different DNN architectures, focusing on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to support large-scale distributed training efficiently. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU architectures available on modern HPC clusters.


Bio

Aamir Shafi Arpan Jain

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University. Contact him at alnaasan.1@osu.edu.

4:00 - 5:30

Short Talks

  • Designing In-network Computing Aware Reduction Collectives in MPI, Bharath Ramesh, The Ohio State University
  • A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs, Kaushik Kandadi Suresh, The Ohio State University
  • High Performance & Scalable MPI library over Broadcom RoCEv2, Shulei Xu, The Ohio State University
  • Implementing an MPI Library over Collective Communication Libraries for Habana Accelerators, Chen Chun Chen, The Ohio State University
  • Bringing MPI support to FPGAs, Nick Contini, The Ohio State University
  • Can I Talk to my SuperComputer?, Pouya Kousha, The Ohio State University
  • DPU-Bench: A New Microbenchmark Suite to Measure the Offload Efficiency of SmartNICs, Ben Michalowicz, The Ohio State University

4:30 - 6:30

Visit to the State of Ohio Computer Center, SOCC (Optional)

6:30 - 9:30

Reception and Dinner at Endeavor Brewing and Spirits

909 W 5th Ave,

Columbus, OH 43212

Tuesday, August 22

7:30 - 8:20

Registration and Continental Breakfast

8:20 - 8:30

Opening Remarks

Tanya Berger-Wolf, Director, Translational Data Analytics Institute (TDAI)
Anish Arora, Chair, Department of Computer Science, The Ohio State University
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

This talk will cover plans for new systems at TACC and elsewhere deployed as part of the Leadership Class Computing Facility. Special emphasis will be placed on MVAPICH, and the role of interconnects and MPI libraries for the evolving architectures and workloads for the new systems and the open science community.


Bio

Taisuke

Associate Vice President for Research at The University of Texas at Austin since 2018 and Executive Director of the Texas Advanced Computing Center (TACC) since 2014, is a nationally recognized leader in high performance computing. He is the principal investigator (PI) for a National Science Foundation (NSF) grant to acquire and deploy Frontera, which will be the fastest supercomputer at any U.S. university. Stanzione is also the PI of TACC's Stampede2 and Wrangler systems, supercomputers for high performance computing and for data-focused applications, respectively. For six years he was co-PI of CyVerse, a large-scale NSF life sciences cyberinfrastructure. Stanzione was also a co-PI for TACC's Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH software family (including MVAPICH-Plus) will be presented. Current status and future plans for OSU INAM and OMB will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Distinguished Professor of Engineering and University Distinguished Scholar at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, Rockport Networks, and Slingshot) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,325 organizations worldwide (in 90 countries). More than 1.69 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 7th, 21st, 36th and 49th ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop, and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 360 organizations in 39 countries. More than 47,000 downloads of these libraries have taken place. He is an IEEE Fellow and a recipient of 2022 IEEE Charles Babbage Award. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

10:15 - 10:45

Morning Coffee Break

Abstract

The high-performance NVIDIA Quantum InfiniBand network provides computing services via in-network computing acceleration engines, such as data aggregation and reduction, Message Passing Interface (MPI) Tag Matching, MPI All-to-All, and more. Offloading these data algorithms to the network, decreases the amount of data traversing the network, dramatically reduces the time of communication framework operations, enables compute and communication overlap and increases datacenter efficiency.


Bio

Adam Moody

Gilad Shainer serves as senior vice-president of networking at NVIDIA. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF consortium, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

Abstract

It is Dell Technologies’ mission to make HPC systems available to everyone, with an emphasis on ease of use, standards compliance without vendor lock-in, while also advancing HPC through research. This presentation shows a novel way of reducing the power footprint of distributed memory and communication intensive workloads, amended with measured results on real-life workloads. The presentation concludes with a case study on the use of MVAPICH on a largescale ocean model.


Bio

Matthew Anderson

Martin Hilgeman joined Dell Technologies in 2011, after having worked as an HPC application specialist for 12 years at SGI and IBM. In 2019, he joined AMD as a senior manager and worked on porting and optimizing the major HPC applications to the “Rome” microarchitecture. Martin returned to Dell Technologies in May 2020 as the HPC performance lead and Distinguished Member of Technical Staff in Dell ISG. He owns a master’s degree in physical chemistry, obtained at the VU University of Amsterdam.

Abstract

The OpenFabrics Interfaces (OFI) define a layered communication API for high-performance fabrics. Its modular design enables fabrics to offer specific features to middleware implementations via a structured interface. This allows middleware to achieve a tight semantic map to the underlying hardware without having to deal with individual hardware features directly. The implementation of OFI, libfabric, achieves this by separating the low-level “provider” layer from the user-level API. This presentation will introduce the motivations for and features of libfabric using Omni-Path as an example and discuss some details of the OPX provider used by Omni-Path.


Bio

Gilad Shainer

Brian Smith is the Director of Technology at Cornelis Networks. Brian has more than 25 years of experience in HPC, starting as an undergrad researcher and sysadmin at the US DOE Ames Lab facility. He received two bachelor's degrees in Computer Engineering and Electrical Engineering (with two minors) and a Master's degree in Computer Engineering from Iowa State.

He moved from there to IBM where he led the communications teams for Blue Gene L, P, and Q. He has over 100 parents and was an IBM Master Inventor. Brian went from IBM to Oak Ridge National Labs where he worked on several projects in parallelizing data analysis for climate science codes. He also worked in user support for the Oak Ridge Leadership Computing Facility, NOAA, and Air Force Weather.

In his spare time, Brian is very active in FIRST FRC robotics where he mentors a high school team and serves as the Lead Robot Inspector for Tennessee.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break

Abstract

In this talk, we will review recent developments of multi-GPU FFT implementations for large-scale systems and study communication frameworks for general parallel transposition of multi-dimensional arrays. We will then evaluate asymptotic scalability behavior, the impact of selecting MPI distributions and types of collective routines for accelerating FFT performance. Finally, we will present experiments on modern supercomputers with different network topologies.


Bio

Hemal Shah

Dr. Alan Ayala is a Software Engineer at AMD. His work focusses on design of FFT software for GPUs and high-performance computing systems. His research interests include GPU and parallel programming, FFT applications, performance optimization, profiling tools, and network interconnects. Before joining AMD, he worked at the Innovative Computing Laboratory as a research scientist and developed heFFTe library for FFT computation on Exascale Systems. Dr. Ayala received his Ph.D. degree in Computational Mathematics in 2018 from Sorbonne University in Paris-France.

Abstract

The MPI Forum is considering the standardization of an Application Binary Interface (ABI). A standard ABI would allow the interoperability of all MPI implementations, extending the convenience provided by the MPICH ABI Initiative to all implementations, including Open MPI. This talk will describe the current proposal for the standard ABI, which attempts to capture the best features of existing ABIs with uncompromising portability. The ecosystem benefits include container usage, third-party language support, the MPI tools ecosystem, and binary package managers like Spack.


Bio

Jithin Jose

Jeff Hammond is a Principal Engineer at NVIDIA based in Helsinki, Finland. He has advanced the MPI standard and its ecosystem for over a decade, including the development of MPI-3 RMA and its use as a back end for Global Arrays and OpenSHMEM, as well as initiating the “large count” initiative and leading the ABI working group. Jeff has a PhD in Chemistry from the University of Chicago and has contributed to the development of NWChem for almost 20 years.

Abstract

The performance and feature gap between bare-metal and Cloud HPC/AI clusters is almost imperceptible on Clouds such as Azure. This is quite evident as Azure Supercomputers have climbed up into the top HPC/AI cluster rankings lists. Public clouds democratize HPC/AI Supercomputers with focus on performance, scalability, and cost-efficiency. As the cloud platform technologies and features continue to evolve, middleware such as MPI libraries and communication runtimes play a key role in enabling applications to make use of the technology advancements, and with high performance. This talk focuses on how MVAPICH2 efficiently enables the latest technology advancements such as SR-IOV, GPU-Direct RDMA, DPU, etc. in virtualized HPC and AI clusters. This talk will also provide an overview of the latest HPC and AI offerings in Microsoft Azure HPC – specifically HBv4 with AMD Genoa-X + NDR400 and NDv5 with Intel SPR, H100 GPUs + NDR400 - along with their performance characteristics with MVAPICH2/MVAPICH2-X.


Bio

Ashok

Dr. Jithin Jose is a Principal Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and performance optimizations. His research interests include high performance interconnects and protocols, parallel programming models, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

3:00 - 4:00

Student Poster Session (In-Person) and Coffee Break

Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI, Kinan Al Attar, The Ohio State University
Impact of System-level Parameters on I/O Performance of HPC Applications, Debasmita Biswas, Virginia Polytechnic Institute and State University
MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators, Chen Chun Chen, The Ohio State University
Scaling Bayesian Group Testing with MPI, Weicong Chen, University of California, Merced
Performance Characterization of using Quantization for DNN Inference on Edge Devices, Tian Chen, The Ohio State University
MV2-FPGA: Enabling Reconfigurable Computing with MPI Support, Nick Contini, The Ohio State University
Towards using Model Architecture-aware Characterization and Hardware-aware Predictions for Intelligent Batch Job Scheduling in HPC, Shruti Dongare, Virginia Polytechnic & State University
Cold Start and Execution Time in Serverless Computing, Yankai Jiang, Northeastern University
Comparative Evaluation of Database for Profiling and Storing Exascale HPC Communication Data, Pouya Kousha, The Ohio State University
Role of Large-Scale Storage Architectures for Heterogeneous Workloads, Olga Kogiou, Florida State University
Understanding Hot Interconnects with an Extensive Benchmark Survey, Yuke Li, University of California, Merced
Battle of the BlueFields: How Much Smarter is the BlueField-3 Than its Predecessor?, Ben Michalowicz, The Ohio State University
TAU Performance System and MVAPICH, Jordi Alcaraz Rodríguez, University of Oregon
Exploring Performance Optimizations Throughout a Comprehensive Search Space Navigation System, Miguel Romero Rosas, University of Delaware
Bliss: Auto-tuning Complex Applications using a Pool of Diverse Lightweight Learning Models, Rohan Basu Roy, Northeastern University
Towards Characterizing DNNs to Estimate Training Time using ‘HARP’ - HPC Application Resource (runtime) Predictor, Manika Swathi Vallabhajosyula, The Ohio State University
Accelerating Large Language Model Training using Hybrid Compression Scheme, Lang Xu, The Ohio State University
High Performance & Scalable MPI library over Broadcom RoCEv2, Shulei Xu, The Ohio State University
MPI4Dask: Efficient MPI-based Communication for Scalable Accelerated Dask Applications, Jinghan Yao, The Ohio State University
High Performance Communication Middleware with On-the-fly GPU-based Compression for HPC and Deep Learning Applications, Qinghua Zhou, The Ohio State University

Abstract

HPC applications benefit from low latency and high message rate communications for small messages. As the network speed increases and hardware data path latency reduces, the software data path overhead of upper layer libraries like MPI starts becoming significant. In this talk, we will propose a set of MPI library enhancements that help in improving overall latency and message rate for small messages.


Bio

Sameer Shende

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all existing Broadcom Ethernet NIC product lines. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of HPC/ML clusters with RoCE, TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds a Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Abstract

Software containers are a key channel for delivering portable and reproducible scientific software in high performance computing (HPC) environments. HPC environments are different from other types of computing environments primarily due to usage of the message passing interface (MPI) and drivers for specialized hardware to enable distributed computing capabilities. This distinction directly impacts how software containers are built for HPC applications and can complicate software quality assurance efforts including portability and performance especially when utilizing specific MPI implementations like MVAPICH2 and large frameworks like MOOSE and its numerous applications. This work introduces a strategy for building containers for HPC applications that adopts layering as a mechanism for software quality assurance. The strategy is demonstrated across three different HPC systems, two of them petaflops scale with entirely different interconnect technologies and/or processor chipsets but running the same container. Performance consequences of the containerization strategy are found to be less than 5-14% while still achieving portable and reproducible containers for HPC systems.

Bio

Vipin Chaudhary

Matthew Sgambati is a High Performance Computing (HPC) System Administrator for Idaho National Laboratory (INL) with over 15 years of experience. He has administered several top 500 supercomputers, including Falcon, Lemhi, and Sawtooth (#37 as of November 2019). He has always had a passion for trying to minimize the barrier to entry for HPC and to this end he implemented and setup INL HPC’s Open OnDemand instances. More recently, he has been exploring containers and their viability in replacing bare metal installs on HPC systems. He has a B.S. and M.S. in Computer Science from the University of Nevada, Reno and is currently pursuing his Ph.D. in Computer Science from the University of Idaho focusing on Reinforcement Learning Schedulers.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. TAU's support for GPUs including CUDA, DPC++/SYCL, OpenCL, OpenACC, Kokkos, and HIP/ROCm improve performance evaluation of heterogenous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for an LLVM plugin for selective instrumentation for compiler-based instrumentation, support for tracking paths taken by a message, timing synchronization costs in collective operations, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH2 under Amazon AWS using the Extreme-scale Scientific Software Stack (E4S) AWS image.


Bio

Vipin Chaudhary

Prof. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, software stacks, HPC container runtimes, and compiler optimizations. He serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc. and ParaTools, SAS.

5:30 - 5:45

Open MIC Session

4:30 - 6:30

Visit to the State of Ohio Computer Center, SOCC (Optional)

6:30 - 9:30

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD

Columbus, OH 43212

Wednesday, August 23

7:30 - 8:30

Registration and Continental Breakfast

Abstract

In this talk, we introduce Stable-TopK, a novel distributed deep learning algorithm that exploits both gradient temporal stability and gradient sparsity to significantly improve training throughput while maintaining accuracy on downstream tasks. Stable-TopK leverages two observations: only a small fraction of gradients contributes significantly to model updates, and the gradient regions containing these extreme gradient elements are stable over time. By only sampling the TopK elements periodically over time, our method reduces communication overhead and accelerates training without sacrificing model performance. Experimental results demonstrate that Stable-TopK outperforms state-of-the-art gradient-sparse training methods in terms of throughput (66.2% and 42.1% reduction in convergence time for BERT and MAE, respectively) and achieves competitive accuracy on various natural language understanding and image classification tasks, showcasing its effectiveness and scalability in training large-scale BERT and MAE models


Bio

Dhabaleswar K (DK) Panda

Dr. Zhao Zhang is a computer scientist and leads the machine learning group in the Data Intensive Computing group at Texas Advanced Computing Center (TACC). Prior joining TACC in 2016, he was a postdoc researcher at AMPLab, UC Berkeley and the data science fellow in Berkeley Institute for Data Science. Dr. Zhang received his Ph.D from the Department of Computer Science at UChicago in 2014. Dr. Zhang has extensive experience in high performance computing (HPC) and big data systems. His recent research focus is the fusion of HPC and deep learning (DL) with a wide range of topics of optimization algorithm, I/O, architecture, and domain applications.

Abstract

In this work, we present heFFTe 2.3, a Highly Efficient FFT library tailored for exascale computing on heterogeneous architectures. We showcase the exceptional performance of our GPU kernels, achieving an impressive 40x speedup over traditional CPU libraries, with support for NVIDIA, AMD, and Intel GPUs. Introducing Batched 2-D/3-D FFTs, FFT Convolution, and Sine and Cosine transforms, our library significantly enhances crucial applications like particle simulations. Through extensive scalability experiments on 6,000+ GPUs and detailed analysis of algorithmic settings, we demonstrate heFFTe's efficacy. Furthermore, addressing communication bottlenecks using mathematical models, and our collaboration through the new PESA project led by Prof. DK Panda with MPI developers and TAU tools, we drive guidelines for next-generation HPC systems. Finally, showcasing results and comparisons on top supercomputers like Summit and Frontier, including evaluations on Oak Ridge National Laboratory's Summit supercomputer featuring 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs, as well as performance analyses on the Frontier supercomputer with AMD GPUs and precursors of the Aurora supercomputer with Intel GPUs at ANL, we underscore heFFTe's pivotal role in optimizing performance and scalability for exascale computing environments.


Bio

Doug

Stan Tomov received a M.S. degree in Computer Science from Sofia University, Bulgaria, and Ph.D. in Mathematics from Texas A&M University. He is a Research Director in ICL and Research Assistant Professor in the EECS at UTK. Tomov's research interests are in parallel algorithms, numerical analysis, and high performance scientific computing (HPC). Currently, his work is concentrated on the development of numerical linear algebra software, and in particular MAGMA, for emerging architectures for HPC, and heFFTe for FFT distributed computations.

Abstract

Livermore Computing (LC), Lawrence Livermore National Laboratory's (LLNL's) supercomputing center, and HPE are deploying the first US exascale system focused on national security. This talk will provide an overview of the preparations for LC's first exascale system, as well as details of its system architecture. Throughout, the talk will explore considerations for energy efficiency in large-scale systems.


Bio

Minsik Kim

As Chief Technology Officer (CTO) for Livermore Computing (LC) at Lawrence Livermore National Laboratory (LLNL), Bronis R. de Supinski formulates LLNL's large-scale computing strategy and oversees its implementation. He frequently interacts with supercomputing leaders and oversees many collaborations with industry and academia. In addition to his work with LLNL, Bronis is also a Professor of Exascale Computing at Queen's University of Belfast. He is a Fellow of the ACM and the IEEE.

10:30 - 11:00

Morning Coffee Break

Abstract

The Gordon Bell winning AWP-ODC application has a long history of boosted performance with MVAPICH on both CPU and GPU-based architectures. This talk will highlight a recent compression support implemented by the MVAPICH team, and its benefits to the large-scale earthquake simulation on the leadership class computing systems. The presentation will conclude with a discussion of the opportunities and technical challenges associated with the development of earthquake simulation software for Exascale computing.


Bio

CHRIS EDSALL
Dr. Yifeng Cui heads the High Performance GeoComputing Lab at SDSC, and helped to establish the Southern California Earthquake Center (SCEC) as a world leader in advancing high performance computing in earthquake system science. Cui’s groundbreaking work includes enabling TeraShake, ShakeOut and M8, some of the worst-case scenarios on San Andreas fault revealing order-of-magnitude LA wave-guide amplification. He is recipient of several HPC awards including 2015 Nvidia Global Impact Award, 2013 IDC HPC innovation excellence award, and 2009/2011 SciDAC OASCR awards. He also directed an Intel Parallel Computing Center on earthquake research. Cui earned his Ph.D. in Hydrology from the University of Freiburg, Germany.

Abstract

MPI is usually implemented as a user-level library and, as its nature, executed in the same context of parallel processes. Thus, MPI internal operations that perform in a synchronous manner may hinder overlapping between communication and computation (in case of synchronous blocking operations) and increase energy consumption (in case of synchronous non-blocking operations). In this perspective, introducing asynchronism in MPI library is important. In this talk, we present our implementations of the asynchronous nonblocking data copy and the asynchronous blocking progress engine for MPI intra-node communications. The asynchronous nonblocking data copy offloads the copy operation to copy engines for intra-node collective communications so that collective communications can overlap with application-level computations. The asynchronous blocking progress engine can save energy consumption by forcing to processes to block without (or with much less) busy waiting.


Bio

Stan Tomov

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on system software for high-end computing systems and cyber-physical systems.

Abstract

Package management tools like Spack keep careful provenance for software, and Software Bills of Materials (SBOMs) are becoming popular to track software provenance for security. However, software relying on MPI is often built for one implementation and run against another, as the ABI compatibility issues for MPI are fairly well studied. This introduces a tension between maintaining the relationship between build provenance and reality, and flexibility in user deployment. Spack has introduced a splicing model to track changes between the build and deployment dependencies of a package. Recent work in Spack allows users to select a package to install and an MPI to redeploy it with, without losing any provenance information. This talk will explain how the splicing model works, how it is integrated into the commands to download and install Spack packages, and what that means for the deployers of MPI packages and their dependencies for reproducible installs from source and binary in a variety of environments.


Bio

Greg Becker

Gregory Becker is a computer scientist at Lawrence Livermore National Laboratory. His focus is on bridging the gap between research and production software at LLNL. His work in software productization has led him to work on Spack, a package manager for high performance computing, as well scalable I/O formats for performance tools. Gregory has been at LLNL since 2015. He received his B.A. in Computer Science and Mathematics from Williams College in 2015.

Abstract

The talk will describe research/development and learning/workforce development (LWD) programs within the Office of Advanced Cyberinfrastructure (OAC) in the CISE Directorate at the National Science Foundation. OAC's mission is to support advanced cyberinfrastructure to accelerate discovery and innovation across all science and engineering disciplines. The programs specifically addressed include: the CyberTraining program for research workforce preparation, including the new Cyberinfrastructure (CI) Professional track; the OAC Core Research Program that is part of the CISE Core Research programs solicitation; the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program for creating software and data CI products and services; the CAREER program for faculty early career development, and the CISE Research Initiation Initiative (CRII) for early career faculty who have not yet been a PI on a Federal grant.


Bio

Ashok

Ashok Srinivasan is a Program Director in the Office of Advanced Cyberinfrastructure at the National Science Foundation and is involved in the CyberTraining, CSSI, and OAC Core programs. Srinivasan has a permanent position as a Professor of Computer Science and the William Nystul Eminent Scholar Chair at the University of West Florida and is a Fulbright Fellow. His research interests focus on the applications of high performance computing to science and public health policy. Results of his research to protect public health, especially during air travel, have been highlighted in over 300 news outlets around the world and cited in testimony to the US Congress.

12:40 - 1:30

Lunch Break

Abstract

Although direct GPU-to-GPU communication has been possible in MPI libraries for over a decade, the limited availability of compatible hardware at academic HPC centers has discouraged the development of algorithms in scientific applications that take advantage of this capability. In this talk, we take Amber, a molecular dynamics code used to simulate proteins and nucleic acids, as a test case. We demonstrate the modifications necessary to implement GPU-to-GPU communication. Compared to the previous implementation, these modifications show an average of approximately 36% improvement in performance overall and 84% for the important explicit solvent subset of the benchmarks.


Bio

Todd Gamblin

Samuel Khuvis is a Scientific Applications Engineer at the Ohio Supercomputer Center. He received his PhD in Applied Mathematics at the University of Maryland Baltimore County in 2016. Before joining OSC he was an HPC Software Engineer at ParaTools, Inc. He is interested in performance analysis and optimization, scientific computing, and parallel computing.

Abstract

In this presentation, we will explain the k-NN Accelerator using Near Memory Processing and MPI Technology. The memory wall problem, caused by the performance difference between the CPU and memory, is becoming increasingly severe in data-intensive applications in the fields of AI and HPC. The Memory EXpander (MEX) is an on-board device with large memory and MPI collective communication accelerator being developed by ETRI to address the above issues. ETRI is collaborating with The Ohio State University to jointly develop the MVAPICH2 library optimized for MEX. MVAPICH2-MEX is an MPI library optimized for MEX. It utilizes mex memory as the communication buffer and mex accelerator as an MPI collective communication accelerator, enabling significant performance improvement in MPI through Near memory processing. We are working on this library with the expectation that it will significantly improve the performance of data-intensive applications. We are developing k-NN Accelerator, a representative data-intensive application, as a target use case for MVAPICH2-MEX to validate the proposed techniques. In today's presentation, we will describe the above technologies that ETRI and OSU are working on together.


Bio

Dan Stanzione

HooYoung Ahn received the B.S. and M.S. degrees in computer science from Sookmyung Women's University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and the Ph.D. degree in school of computing from the Korea Advanced Institute of Science and Technology (KAIST), Deajeon, Republic of Korea, in 2016. She is currently a Senior Researcher with the Supercomputing Technology Research Center, Electronics and Telecommunications Research Institute, Deajeon, Republic of Korea. Her research interests include parallel processing, artificial intelligence, high performance computing.

Abstract

SDSC operates and supports several NSF funded clusters ranging from the Expanse supercomputer (NSF Award# OAC 1928224) targeted at long-tail workloads with both CPU and GPU based nodes to Voyager (NSF Award# OAC 2005369) featuring custom deep learning focused processors. Expanse's standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and have 256 GB of DDR4 memory, while each GPU node contains four NVIDIA V100s (32 GB SMX2) connected via NVLINK and dual 20-core Intel Xeon 6248 CPUs. Voyager features 42x Intel Habana Gaudi training nodes, each with 8 training processors (336 in total). The training processors feature on-chip networking with RoCE support, scaled up using a 400GigE switch. In addition, there are 2 nodes with first generation inference processors and 36 compute nodes with Intel Xeon processors. We will present results using MVAPICH2, MVAPICH2-GDR on the general-purpose Expanse supercomputer and the custom MVAPICH2 implementation on Voyager. In addition, INAM was recently deployed on SDSC's Comet supercomputer and our experiences with the deployment and usage will be discussed.


Bio

Aamir Shafi

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC as a Computational and Data Science Research Specialist Manager. He has led the support of high-performance computing and data applications software on several NSF and UC funded HPC and AI supercomputers including Voyager, Expanse, Comet, and Gordon at SDSC. His research interests are in HPC architecture and systems, performance and scalability, benchmarking and HPC middleware. He has worked on many NSF funded optimization and parallelization research projects such as MPI performance tuning frameworks, hybrid programming models, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He is co-PI on the NSF funded Expanse HPC system and the National Research Platform projects at SDSC.

Abstract

This talk will present an overview of three products with enhanced capabilities by X-ScaleSolutions. The products are: 1) MVAPICH2-DPU communication library using NVIDIA Bluefield DPUs, 2) X-ScaleHPL-DPU package for enhanced HPL benchmark, and 3) X-ScaleAI-DPU package for Deep Learning applications. The MVAPICH2-DPU library takes advantage of the features to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and DL applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. The X-ScaleHPL-DPU package optimizes the performance of HPL benchmark using the advanced features of MVAPICH2-DPU package. The X-ScaleAI-DPU package incorporates and optimizes state-of-the-art open-source and in-house components to support popular DL frameworks with noticeable performance improvement. It achieves excellent out-of-the-box performance and one-click deployment and execution. In addition, X-ScaleAI-DPU supports efficient and scalable checkpoint-restart operations.


Bio

Aamir Shafi

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 40 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

3:30 - 4:00

Afternoon Coffee Break

Abstract

Most of currently deployed largest HPC systems are based on GPU accelerators. Direct intra-node and inter-node peer-to-peer communication between GPUs is crucial for obtaining good performance on such system using MPI. The talk will present results of the first experiments performed in Cambridge Open Zettascale Lab in using these features on Intel Data Center GPU Max Series cards.


Bio

Aamir Shafi

Kacper received Master degree in Astronomy from University of Warsaw followed by PhD in theoretical astrophysics from Nicolaus Copernicus Astronomical Center in Warsaw. He then worked in several research institutions joining University of Cambridge in 2015 at Department of Applied Mathematics and Theoretical Physics as parallel programmer and HPC administrator. He moved to Research Computing Services as a Senior Research Software Engineer in 2020.

Abstract

As the computing, networking, heterogeneous hardware, and storage technologies continue to evolve in HEC platforms, Understanding the full-stack performance tradeoffs and interplay between HPC applications, MPI libraries, the communication fabric, the file system, and the job scheduler becomes more challenging endeavor. Such understanding will enable all involved parties to understand the bottlenecks to maximize the efficiency and performance of the individual components that comprise a modern HPC system and solve different grand challenge problems. Through this tutorial, the participants will learn how to use the OSU InfiniBand Network Analysis and Monitoring (INAM) tool in conjunction with live jobs running on various remote clusters at OSC and OSU to visualize, analyze, and correlate how the MPI runtime, high-performance network, I/O filesystem, and job scheduler interact and identify potential bottlenecks online. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. We will request remote access to the Pitzer system at OSC and the RI/RI2 clusters at OSU for hands-on exercises. This will help to prepare participants to locate and diagnose performance bottlenecks in their own clusters and parallel programs.


Bio

Pouya Kousha Pouya Kousha

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages.

Pouya Kousha is currently a PhD student at the Ohio State University, supervised by Prof. DK. Panda. His research interests are Parallel Algorithms and Distributed Systems, High Performance Computing, and Profiling Tools. His work primary focused on scalable and real-time analysis, monitoring and profiling tools and use them to solve the bottlenecks and optimize MPI library and applications. For more information contact him at kousha.2@osu.edu

Abstract

The High Performance Computing (HPC) community has widely adopted Message Passing Interface (MPI) libraries to exploit high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. This talk provides an overview of MPI4Spark and MPI4Dask that are enhanced versions of Spark and Dask frameworks, respectively. These stacks can utilize MPI for communication in a parallel and distributed setting on HPC systems connected via fast interconnects. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs. MPI4Dask provides point-to-point asynchronous I/O communication coroutines, which are non-blocking concurrent operations defined using the async/await keywords from the Python's asyncio framework. The talk concludes by evaluating the performance of MPI4Spark and MPI4Dask on the state-of-the-art HPC systems.


Bio

Aamir Shafi

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

5:00

Closing Remarks