MUG'24

(Final Program)

All Times Are U.S. EDT

Bale Theater at Ohio Supercomputer Center

Mug Photo

MUG'24 conference attendees gather for a group photo.

Monday, August 19

7:30 - 8:30

Registration and Continental Breakfast

Abstract

As Cornelis Networks gears up to release CN5000, the next step in the Omni-Path Architecture family of interconnects, we would like to take the opportunity to explore just how the HW works. Understanding how the hardware works is crucial to writing the best performing software. This talk will discuss the major building blocks of the ASIC and explore how they are exposed in software. Our goal is to provide a technical understanding of the hardware and its capabilities to allow for the highest message passing performance.


Bio

Dennis

Dennis Dalessandro is a kernel engineer for Cornelis Networks leading in the development of Omni-Path Architecture HW drivers. He received a BS in Computer Science from The Ohio State University. Over the past 18 years he has been a researcher for the Ohio Supercomputer Center, a performance engineer at NetApp, and a driver developer at Intel. Dennis is a very active supporter of OpenSouce software and enjoys working closely with the Kernel.org community and with various Linux distributions.

Abstract

The talk will present an HPC-AI environment on AWS and Google Cloud that includes NVIDIA NeMo optimized for GPUs. ParaTools Pro for E4S is a cloud image that integrates with a performant remote-desktop (based on ODDC), MVAPICH MPI, and a Torque based scheduler available from Adaptive Computing, Inc. The Extreme-scale Scientific Software Stack (E4S) is a curated, Spack based software distribution of 100+ HPC, EDA, and AI/ML packages. It features AI tools such as TensorFlow, PyTorch, NVIDIA NeMo, JAX, Google's Gemini API based chatbot, Horovod and other supporting tools including langchain, pandas, and SciKit-Learn and supports AWS EFA and Google's IPUs with the optimized MVAPICH MPI distribution. It includes Codium, an IDE, Jupyter notebook, and visualization tools such as VisIt and ParaView all launched from the Chrome web browser without any additional software. This multi-user, multi-node, multi-gpu cloud image use E4S and Spack as the core components for product integration and deployment of a range of HPC and AI/ML tools. These include performance evaluation tools such as TAU, HPCToolkit, DyninstAPI, PAPI, etc. and support both bare-metal and containerized deployment for CPU and GPU platforms. Container runtimes featured in the image include Docker, Singularity, Charliecloud, and Kubernetes (microk8s). E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms. It has built a comprehensive, extensible, coherent software stack that enables application developers to productively develop highly parallel applications that effectively target commercial cloud platforms.


Bio

Richard

Sameer Shende serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon and the President and Director of ParaTools, Inc. (USA) and ParaTools, SAS (France). He serves as the lead developer of the Extreme-scale Scientific Software Stack (E4S), TAU Performance System, Program Database Toolkit (PDT), and HPC Linux. His research interests include scientific software stacks, performance instrumentation, compiler optimizations, measurement, and analysis tools for HPC. He served as the General Co-Chair for ICPP 2021 and is serving as the General Co-Chair of EuroPar'24. He was the vice chair for technical papers for SC22 and has served as the chair of the Performance Measurement, Modeling, and Tools track at the SC17 conference. He received his B.Tech. in Electrical Engineering from IIT Bombay in 1991, and his M.S. and Ph.D. in Computer and Information Science from the University of Oregon in 1996 and 2001 respectively.

11:00 - 11:30

Morning Coffee Break

Abstract

NVIDIA’s BlueField Processing Unit family of system-on-a-chip devices provides ingredients needed for moving communication algorithm management from the host to the DPU. NVIDIA’s DOCA environment provides run-time support for DPU offloaded algorithms. In this presentation we will describe the DOCA architecture and capabilities, the offloaded implementation of MPI_Ialltoallv and MPI_Allgatherv and their performance characteristics and the impact of the offloaded algorithms on application performance.


Bio

Broadcom

Dr. Richard Graham is a Senior Director, HPC Technology at NVIDIA's Networking Business unit. His primary focus is on HPC network software and hardware capabilities for current and future HPC technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

12:30 - 1:30

Lunch Break

Abstract

The X-ScaleAI package provides an optimized and integrated software stack for high-performance distributed pre-training, fine-tuning, and inference. It supports models defined in PyTorch or HuggingFace, including large language models and vision models. The X-ScaleSecureMPI library adds security protection in MPI communication for HPC applications with minimal performance overhead. This tutorial will discuss the key features, performance benefits, and some basic steps in using these products.


Bio

Donglai

Dr. Donglai Dai is Chief Engineer at X-ScaleSolutions, where he leads the company’s R&D team. His current work focuses on developing and enhancing communication libraries, checkpointing and restart libraries, secure communication libraries, performance analysis tools for distributed and parallel HPC and AI applications on HPC systems and clouds. He is the principal investigator for several DOE SBIR grants and a member of the PC committee of SuperCheck-SC23 workshop. Before joining X-Scale Solutions, he worked at Intel, Cray, and SGI. He holds over 10 US patents and has published over 40 technical papers, presentations, or book chapters. He earned a PhD in computer science from Ohio State University.

Abstract

The tutorial will start with an overview of the MVAPICH libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for NVIDIA/AMD GPU-enabled clusters in MAVPICH-Plus/MVAPICH2-GDR and many-core systems in MVAPICH-Plus/MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications to demonstrate how one can effectively take advantage of MVAPICH for High End Computing applications using MPI and CUDA/OpenACC will also be presented.

Bio

Hari Nat

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE.

Nat Shineman is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the MVAPICH2 library.

3:00 - 3:30

Afternoon Coffee Break

Abstract

The OSU Microbenchmark suite is a popular set of benchmarks for evaluating the performance of HPC systems. In this tutorial, we will take the attendees through the new set of features that have been added to OMB - like support for Java-based benchmarks, Python-based benchmarks. We will also discuss the enhancements done for the C benchmarking suite like support for user defined data types, creating graphical representations of output, data validation, support for profiling tools like PAPI, and support for newer MPI primitives like persistent operations, MPI sessions.


Bio

Hari Subramoni Aamir Shafi Aamir Shafi

Dr. Hari Subramoni is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE.

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

Akshay Paniraja Guptha is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high-performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the OSU Microbenchmarks and the MVAPICH2 library.

Abstract

The fields of Machine and Deep Learning (ML/DL) have witnessed remarkable advances in recent years, paving the way for cutting-edge technologies and leading to exciting challenges and opportunities. Modern ML/DL frameworks, including TensorFlow, PyTorch, and cuML, have emerged to offer high-performance training and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL leveraging powerful hardware architectures, interconnects, and distributed frameworks to accelerate the training of ML/DL models, especially as they grow larger and more complicated. We present an overview of different DNN architectures, focusing on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to support large-scale distributed training efficiently. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU architectures available on modern HPC clusters.


Bio

Aamir Shafi Arpan Jain

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University. Contact him at alnaasan.1@osu.edu.

4:15 - 5:00

Short Talks

  • Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models, Nawras Alnaasan, The Ohio State University
  • OHIO: Improving RDMA Network Scalability in MPI_Alltoall through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design, Tu Tran, The Ohio State University
  • Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs, Chen-Chun Chen, The Ohio State University
  • Demystifying the Communication Characteristics for Distributed Transformer Models, Ben Michalowicz, The Ohio State University
  • Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning, Quentin Anthony, The Ohio State University

5:00 - 6:30

Visit to the State of Ohio Computer Center, SOCC (Optional)

6:30 - 9:30

Reception and Dinner at Endeavor Brewing and Spirits

909 W 5th Ave,

Columbus, OH 43212

Tuesday, August 20

7:30 - 8:20

Registration and Continental Breakfast

8:20 - 8:30

Opening Remarks

David Hudak, Ohio Supercomputer Center and
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

Demand for Artificial Intelligence (AI) compute using GPUs is increasing with unprecedented impacts on capital and operational expenditures (CapEX and OpEx) and the environment such as the Greenhouse Gases (GHG) emissions . The University of Bristol is in the process of deploying a national AI Research Resource (AI RR) called Isambard-AI, which has already begun providing leadership-class compute capability for academic and research communities in the UK. The Isambard-AI phase 1 system is ranked number 2 on the June 2024 Green500 list of supercomputing platforms. Isambard-AI has been setting and breaking records as being one of the quickest systems to be deployed, from procurement to operations in less than 6 months. Machine learning models for biomolecular design, AI models for translational discovery research and investigation into skin tone biases for skin cancer are among a few leveraging Isambard-AI phase 1 platform. This talk overviews a set of architectural, procurement and operational design recipes that include full life cycle costs during the entire lifetime, including reusability, eventual decommissioning, and recyclability. The University of Bristol's target of NetZero by 2030 was among the key motivators for the design of the Isambard-AI sustainability solution.


Bio

Sadaf

Sadaf Alam is Chief Technology Officer (CTO) for Isambard supercomputing Digital Research Infrastructures (DRIs) and director of strategy and academia in the Advanced Computing Research Centre at the University of Bristol, UK. She is responsible for digital transformation for research computing and data assets management services. Prior to joining Bristol, Dr Alam was the CTO at CSCS, the Swiss National Supercomputing Centre. She was chief architect for two generations of Piz Daint innovative flagship supercomputing facilities and MeteoSwiss operational weather forecasting platforms. She was technical lead for European supercomputing centres’ federation project called Fenix. From 2004-2009 Dr Alam was a computer scientist at Oak Ridge National Laboratory (ORNL), USA, and a staff scientist at the ORNL Leadership Computing Facility (OLCF). She studied computer science at the University of Edinburgh, UK, where she received her PhD. She was a founding member of the Swiss Chapter of Women in HPC.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH software family (including MVAPICH-Plus) will be presented. Current status and future plans OMB will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Distinguished Professor of Engineering and University Distinguished Scholar at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, Rockport Networks, and Slingshot) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,400 organizations worldwide (in 92 countries). More than 1.81 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 13th, 33rd, and 57th ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop, and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 365 organizations in 39 countries. More than 49,000 downloads of these libraries have taken place. He is an IEEE Fellow and a recipient of 2022 IEEE Charles Babbage Award, and a recipient of the 2024 IEEE TCPP Outstanding Service and Contributions Award. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

10:15 - 10:45

Morning Coffee Break

Abstract

NVIDIA networking technologies are designed for training AI at scale. In-network computing, highly effective bandwidth, and noise isolation capabilities have facilitated the creation of larger and more complex foundational models. We'll dive deep into the recent technology announcements and their essential roles in next-generation AI data center designs.


Bio

Adam Moody

Gilad Shainer serves as senior vice president of networking at NVIDIA, focusing on high-performance computing and artificial intelligence. He holds multiple patents in the field of high-speed networking. Gilad Shainer holds an M.S. and a B.S. in electrical engineering from the Technion Institute of Technology in Israel.

Abstract

As Cornelis Networks prepares to release its CN5000 high-performance network fabric, this presentation will discuss hardware and software enhancements with an eye toward optimizing message libraries.


Bio

Gilad Shainer

As vice president of software engineering at Cornelis Networks, Doug oversees the entire software stack, including Omni-Path Architecture drivers, messaging software, and embedded device control systems. Prior to Cornelis, Doug led software engineering teams at Red Hat in cloud storage and data services. Doug’s career in HPC and cloud computing began at Ames National Laboratory’s Scalable Computing Laboratory. He later worked at the US Department of Energy’s Oak Ridge National Laboratory, where he developed and integrated new technologies at the Oak Ridge Leadership Computing Facility. Doug holds bachelor’s and master’s degrees in computer science from Iowa State University. He first installed MVAPICH 0.9 on the main compute cluster at Arizona State University.

Abstract

The TAU Performance System is a versatile performance evaluation toolkit supporting both profiling and tracing modes of measurement. It supports performance evaluation of applications running on CPUs and GPUs and supports runtime-preloading of a Dynamic Shared Object (DSO) that allows users to measure the performance without modifying the source code or binary. This talk will describe how TAU may be used with MVAPICH and support advanced performance introspection capabilities at the runtime layer. TAU's support for tracking the idle time spent in implicit barriers within collective operations in MPI will be demonstrated. TAU also supports event-based sampling at the function, file, and statement level. TAU's support for runtime systems such as CUDA (for NVIDIA GPUs), Level Zero (for Intel oneAPI DPC++/SYCL), ROCm (for AMD GPUs), OpenMP with support for OMPT and Target Offload directives, Kokkos, and MPI allow instrumentation at the runtime system layer while using sampling to evaluate statement-level performance data. TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for an LLVM plugin for selective instrumentation for compiler-based instrumentation, support for tracking paths taken by a message, timing synchronization costs in collective operations, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH2 under Amazon AWS and GCP using the ParaTools Pro for E4S image.


Bio

Vipin Chaudhary

Sameer Shende serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon and the President and Director of ParaTools, Inc. (USA) and ParaTools, SAS (France). He serves as the lead developer of the Extreme-scale Scientific Software Stack (E4S), TAU Performance System, Program Database Toolkit (PDT), and HPC Linux. His research interests include scientific software stacks, performance instrumentation, compiler optimizations, measurement, and analysis tools for HPC. He served as the General Co-Chair for ICPP 2021 and is serving as the General Co-Chair of EuroPar'24. He was the vice chair for technical papers for SC22 and has served as the chair of the Performance Measurement, Modeling, and Tools track at the SC17 conference. He received his B.Tech. in Electrical Engineering from IIT Bombay in 1991, and his M.S. and Ph.D. in Computer and Information Science from the University of Oregon in 1996 and 2001 respectively.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break

Abstract

Butterfly valve performance factors are crucial to the pressurized water reactor industry, and computational fluid dynamics studies of these valves are critical both from a design and safety standpoint. This talk presents results using the Navier-Stokes module in the MOOSE framework built on MVAPICH for simulating butterfly valve performance factors and compares those simulations with empirical results from a nuclear reactor butterfly valve in operation. Multiphysics operations using MVAPICH are discussed.


Bio

Jithin Jose

Matthew Anderson is the manager for the High Performance Computing group at Idaho National Laboratory which maintains and operates the principal HPC datacenter supporting nuclear energy research for the Department of Energy. He came to INL in 2019 from Indiana University where he was assistant research scientist. He is a co-author over 40 peer reviewed publications and one textbook on High Performance Computing and has over 15 years’ experience working in the HPC industry. He holds a Ph.D. in physics from the University of Texas at Austin.

Abstract

In this presentation, I will delve into the critical aspects of scaling and optimizing training for Large Language Models (LLMs) using the DeepSpeed framework. DeepSpeed has emerged as a pivotal tool in the field, integrating advanced communication backends such as NCCL and MPI to facilitate efficient GPU-to-GPU communication. This talk will commence with an overview of DeepSpeed's capabilities, emphasizing its communication backend module that supports NCCL, MPI, and hybrid/mixed modes.

Drawing inspiration from MCR-DL, a project led by an OSU student while interning at Microsoft, we will showcase the latest advancements: ZeRO++ and DeepSpeed-Ulysses. These cutting-edge techniques, pioneered by the DeepSpeed team, address key challenges in LLM training by optimizing memory and communication patterns. The presentation will explore the nuanced communication characteristics, challenges encountered, and innovative solutions in both ZeRO++ and Ulysses. The goal is to ignite ideas and discussions among graduate students and researchers on harnessing the synergy and cooperation between different communication runtimes (MVAPICH and NCCL) to further enhance collaboration and push the boundaries of performance at scale and efficiency.


Bio

Sameer Shende

Ammar Ahmad Awan is a Principal Research Manager at Microsoft leading a team of researchers and software engineers investigating inference optimizations and building DeepSpeed-FastGen, an open-source inference library to serve AI models in production. In the past, he led the development of DeepSpeed-MoE, which supports both training and inference of MoE models at scale. He received his B.S, M.S, and Ph.D. degrees in Computer Science from the National University of Science and Technology (NUST), Pakistan, Kyung Hee University (KHU), South Korea, and The Ohio State University, respectively. His current research focus lies at the intersection of high-performance systems and large-scale training and inference of deep learning (DL) models. He previously worked on a Java-based Message Passing Interface (MPI) and nested parallelism with OpenMP and MPI for scientific applications. He has published several papers in conferences and journals related to these research areas. He actively contributed to various projects like MVAPICH2-GDR (High Performance MPI for GPU clusters, OMB (OSU Micro Benchmarks), and HiDL (High Performance Deep Learning) during his graduate studies at OSU.

Abstract

In this talk, we will survey recent developments of multi-GPU FFT implementations on large-scale systems, and study communication frameworks for general parallel transposition of multi-dimensional arrays. We will then evaluate asymptotic scalability behavior, the impact of selecting MPI distributions and types of collective routines for accelerating FFT performance. Finally, we will present several experiments on modern supercomputers leveraging different MPI libraries, such as MVAPICH2, and network topologies.


Bio

Hemal Shah

Dr. Alan Ayala is a member of the technical staff at AMD. His work focusses on design of FFT software for GPUs and high-performance computing systems. His research interests include GPU and parallel programming, FFT applications, performance optimization, profiling tools, and network interconnects. Before joining AMD, he worked at the Innovative Computing Laboratory as a research scientist and developed heFFTe library for FFT computation on Exascale Systems. Dr. Ayala received his Ph.D. degree in Computational Mathematics in 2018 from Sorbonne University in Paris-France.

3:00 - 3:45

Student Poster Session (In-Person) and Coffee Break

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters, Nawras Alnaasan, The Ohio State University
MCR-DL: Mix and Match Communication Runtime for Deep Learning, Quentin Anthony, The Ohio State University
Exploring Design Factors for Optimizers in Large-Batch and Second-Order Deep Learning, Shuyuan Fan, Rutgers University
High Performance and Carbon-Aware Serverless Workloads Scheduling via Multi-generation Hardware, Yankai Jiang, Northeastern University
DPU-Bench A New Microbenchmark Suite to Measure the Offload Efficiency of SmartNICs, Ben Michalowicz, The Ohio State University
PULSE:Using Mixed-Quality Models for Reducing Serverless Keep-Alive Cost, Kausalya Sankaranarayanan, Northeastern University
Characterizing the relationship between GPU performance and aggregate ECC errors, Yu Sun, George Mason University
Enhancing Earthquake Simulation Accuracy Through Optimized Checkpointing and Compression Techniques Using MVAPICH, Arnav Talreja, University of California, San Diego
Performance Implications of MPI-IO Consistency, Chen Wang, Lawrence Livermore National Laboratory
High Performance & Scalable MPI Library Over Broadcom RoCE, Shulei Xu, The Ohio State University
A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference, Jinghan Yao, The Ohio State University
Waveguide or not? Revised ground motion simulations for greater Los Angeles from the M7.8 ShakeOut earthquake scenario, Te-Yang Yeh, San Diego State University
Optimizing Subgraph Matching on Temporal Graphs, Hyunjin Yi, Florida State University
Characterizing GPU Memory Errors: Insights from a Cross-supercomputer Study, Zhu Zhu, George Mason University

Abstract

The talk will introduce the audience to C-DAC’s indigenously developed Trinetra interconnect and will focus on enabling MVAPICH2 over Trinetra interconnect. Performance numbers with MVAPICH2 over Trinetra will also be discussed during the talk.


Bio

 Yogeshwar Sonawane

Mr. Yogeshwar Sonawane is associated with Centre for Development of Advanced Computing (C-DAC), Pune for last 20 years. He is working with HPC Technologies group and holds a position of Scientist F. He leads the team involved in system software development for Trinetra network and firmware development for Rudra server platform. His research interests include High Performance Interconnects, Programming Models and Performance Optimizations. Yogeshwar has a Bachelor of Engineering (B.E.) degree in Electronics & Telecommunications from Govt. College of Engineering, Pune (COEP), India.

Abstract

With the exponential growth of geospatial data coming from various sources, the need for more robust methods to handle these data at large scale has become a necessity. The challenges are manifold, from geospatial data modeling to weather and climate forecasting and detecting excursion sets in spatial and spatio-temporal data. By improving the accuracy and efficiency of large-scale geospatial processing techniques, we can significantly enhance the accuracy and reliability of these applications, leading to better decision-making and more accurate predictions. Thus, HPC can play a vital role in efficiently processing and analyzing this massive influx of data. Modern hardware architectures such as GPUs are increasingly used to enhance computational capabilities on multi- and many-core architectures. This powerful hardware can offer a more sustainable solution when combined with approximation methods that provide high accuracy while requiring less computational power. Through our recent research, we developed robust and sustainable solutions for geospatial applications, minimizing energy consumption and reducing the environmental impact of large-scale data analysis. These solutions rely on HPC systems and incorporate low-rank and mixed-precision approximations. This talk will highlight the benefits of these solutions in terms of sustainability and efficiency in the context of several geospatial applications.


Bio

Vipin Chaudhary

Sameh Abdulah obtained his MS and Ph.D. degrees from Ohio State University, Columbus, USA, in 2014 and 2016, respectively. Presently, he serves as a research scientist at the Extreme Computing Research Center (ECRC), King Abdullah University of Science and Technology, Saudi Arabia. His research focuses on various areas, including high-performance computing applications, big data, bitmap indexing, handling large spatial datasets, parallel spatial statistics applications, algorithm-based fault tolerance, and machine learning and data mining algorithms. Sameh was a part of the KAUST team nominated for the ACM Gordon Bell Prize in 2022 and 2024 (climate track) for their work on large-scale climate/weather modeling and prediction.

4:45 - 5:00

Open MIC Session

5:00 - 6:30

Visit to the State of Ohio Computer Center, SOCC (Optional)

6:30 - 9:30

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD

Columbus, OH 43212

Wednesday, August 21

7:30 - 8:30

Registration and Continental Breakfast

Abstract

In this talk, I’ll cover upcoming developments in computing research infrastructure to support future big data/AI/HPC platforms, including design inputs for future systems. I’ll also talk about early experience on the Vista systems, one of the first deployments of NVIDIA’s Grace CPU and Grace-Hopper CPU/GPU integration. I’ll also touch on the programming models we are supporting, shifts in user workloads, and upcoming research directions.


Bio

Taisuke

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin and Executive Director of the Texas Advanced Computing Center (TACC), is a nationally recognized leader in high performance computing, and has been involved in supercomputing for more than 30 years. He is the principal investigator (PI) for a number of the National Science Foundation (NSF) supercomputers, including the current Frontera system, which is the fastest supercomputer at a U.S. university, and is leading the upcoming NSF Leadership Class Computing Facility. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

Supercomputers are not only the core infrastructure that enables simulations in all areas where experiments are difficult, but they are also increasingly being used to develop AI technologies. In this talk, we will introduce the Supreme-K project led by the Korean government. I will introduce the accelerators, SW and HW platforms for the supercomputer, which is the first in-house designed and developed in Korea, and the future development direction.


Bio

Dhabaleswar K (DK) Panda

Yoomi Park received her Ph.D. from Chungnam National University, Republic of Korea, in 2010. She is currently a principal researcher at the Supercomputing System Research Section, Electronics and Telecommunications Research Institute, Daejeon, Korea. Her research interests include high performance computing, artificial intelligence, and distributed and parallel computing.

Abstract

SDSC operates and supports several NSF funded clusters ranging from the Expanse supercomputer (NSF Award# OAC 1928224) targeted at long-tail workloads with both CPU and GPU based nodes to Voyager (NSF Award# OAC 2005369) featuring custom deep learning focused processors. Expanse's standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and have 256 GB of DDR4 memory, while each GPU node contains four NVIDIA V100s (32 GB SMX2) connected via NVLINK and dual 20-core Intel Xeon 6248 CPUs. Recently the cluster has been expanded with an additional rack to support the Center for Western Weather and Water Extremes (CW3E). Voyager features 42x Intel Habana Gaudi training nodes, each with 8 training processors (336 in total). The training processors feature on-chip networking with RoCE support, scaled up using a 400GigE switch. We will present updated results using MVAPICH2, MVAPICH2-GDR on SDSC systems. In addition, our experiences with the deployment and usage of INAM on the Expanse supercomputer will be discussed. An overview of the upcoming Cosmos system (NSF Award# OAC 2404323) featuring the AMD Instinct MI300A APUs and HPE Slingshot Interconnect will be provided.


Bio

Aamir Shafi

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC as a Computational and Data Science Research Specialist Manager. He has led the support of high-performance computing and data applications software on several NSF and UC funded HPC and AI supercomputers including Voyager, Expanse, Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as MPI performance tuning frameworks, hybrid programming models, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He has also led tutorials on AI, HPC, and Kubernetes topics at several PEARC and SC conferences. He is co-PI on the NSF funded Expanse HPC system and the Prototype National Research Platform (PNRP) projects at SDSC. He is the PI on a NSF funded category II system Cosmos that will feature AMD Instinct MI300A accelerated processing units (APUs) that feature both CPU and GPU capabilities with a unified memory architecture.

10:30 - 11:00

Morning Coffee Break

Abstract

The Gordon Bell-winning AWP-ODC application continues to push the boundaries of earthquake simulation by leveraging the enhanced performance of MVAPICH on both CPU and GPU-based architectures. This presentation highlights the recent improvements to the code and its application to broadband deterministic 3D wave propagation simulations of earthquake ground motions, incorporating high-resolution surface topography and detailed underground structures. The results of these simulations provide critical insights into the potential impacts of major earthquakes, contributing to more effective disaster preparedness and mitigation strategies. Additionally, the presentation will address the scientific and technical challenges encountered during the process and discuss the implications for future large-scale seismic studies on Exascale computing systems.


Bio

CHRIS EDSALL
CHRIS EDSALL

Dr. Yifeng Cui heads the High Performance GeoComputing Lab at SDSC, and helped to establish the Southern California Earthquake Center (SCEC) as a world leader in advancing high performance computing in earthquake system science. Cui’s groundbreaking work includes enabling TeraShake, ShakeOut and M8, some of the worst-case scenarios on San Andreas fault revealing order-of-magnitude LA wave-guide amplification. He is recipient of several HPC awards including 2015 Nvidia Global Impact Award, 2013 IDC HPC innovation excellence award, and 2009/2011 SciDAC OASCR awards. He also directed an Intel Parallel Computing Center on earthquake research. Cui earned his Ph.D. in Hydrology from the University of Freiburg, Germany.

Dr. Te-Yang Yeh is a postdoctoral research scholar at San Diego State University, working in collaboration with the San Diego Supercomputer Center (SDSC) on the development of advanced wave propagation simulation codes. Dr. Yeh has been deeply involved in earthquake ground motion simulations using high-performance computing (HPC) to refine the understanding of the Earth's elastic properties and enhance the accuracy of predicted ground motions in Southern California and other regions across the United States. Leveraging the exceptional performance of the Gordon Bell Award-winning AWP-ODC application, Dr. Yeh conducts large-scale deterministic 3D wave propagation simulations at frequencies up to 10Hz, significantly advancing the use of numerical simulations for real-world seismic risk assessments.

Abstract

At LLNL, we are using Spack to build, test, and deploy MVAPICH on our production clusters. This presentation will go over our strategy, including lessons learned, benefits of using Spack for MPI deployments, and our future plans using new Spack features.


Bio

Stan Tomov

Nathan Hanford is a Computer Scientist in the Livermore Computing Division at Lawrence Livermore National Laboratory. His research is currently focused on application and development environment portability for parallel software applications at the application binary interface (ABI). His operational work is focused on development environment design and verification, message passing interface (MPI) support and development, and system-wide accelerator-aware interconnect benchmarking for codesign, system acceptance, and strategic decision-making support.

Towards accomplishing these goals, Nathan collaborates with Groupe EOLEN and the Commissariat a l'énergie atomique et aux énergies alternatives (CEA), leveraging the Wi4MPI project, which dynamically translates ABI-incompatible MPI operations at runtime. He also works closely with multiple vendor partners to increase their middleware portability to a variety of compute clusters, and participates in the MPI Forum.

Nathan came from a high-performance networking background. While earning his PhD at University of California Davis, he was a perennial summer student with ESnet at Lawrence Berkeley Laboratory. During this time, he focused on end-system optimizations and congestion avoidance for high-speed, long-distance networking.

Abstract

The Office of Advanced Cyberinfrastructure (OAC) supports and coordinates the development, acquisition, and provision of state-of-the-art cyberinfrastructure resources, tools and services essential to the advancement and transformation of science and engineering. OAC also supports forward-looking research and education to expand the future capabilities of cyberinfrastructure specific to science and engineering. By fostering a vibrant ecosystem of technologies and a skilled workforce of developers, researchers, staff and users, OAC serves the growing community of scientists and engineers, across all disciplines, whose work relies on the power of an advanced research cyberinfrastructure. In pursuit of this mission, OAC supports the exploration, development, and deployment of a wide range of cyberinfrastructure technologies within a highly interoperable and collaborative ecosystem. These include: advanced computing, networks and services for computational and data-intensive science and engineering research; trustworthy, reusable and sustainable community software for science and engineering; robust and reusable data tools to aid all research communities in their management and use of digital information. OAC also supports training programs, scholarly exchanges, and virtual organizations, to enable the productive use, sustainable maintenance, and effective governance of these systems. In these efforts, OAC collaborates with all NSF Offices and Directorates to develop models, prototypes, and approaches to research cyberinfrastructure that open new frontiers for discovery, furthering the mission of the National Science Foundation and national science and engineering priorities. The goal of this talk is to expose faculty/researchers to opportunities in the NSF’s CISE/OAC division programs, providing tips for writing successful NSF proposals, and fostering collaboration.


Bio

Aamir Shafi

Sheikh Ghafoor is a professor of Computer Science at Tennessee Tech University. Currently he is serving as a program director at the US National Science Foundation in the office of Advanced Cyber Infrastructure in the directorate of Computer and Information Science and Engineering. He received his Ph.D. in Computer Science from Mississippi State University. His main research interests are: 1) High Performance Computing, 2) Computer Security, 3) Computational Earth Science, and 4) Computer Science Education. Dr. Ghafoor has multiple active, externally funded research projects in each of these areas. He has been the principal investigator on grants from the National Science Foundation, National Aeronautics and Space Administration, Department of Energy, National Security Agency, and other agencies, and currently mentoring several graduate and undergraduate students working on these research projects at Tennessee Tech University (TTU). Dr. Ghafoor has secured 29 externally funded grants as either PI (19) or a Co-PI(10) with a total budget of approximately $8.7 million (TTU portion $5.44 million). As a professor, Dr. Ghafoor has taught a wide variety of courses with primary teaching interest in parallel distributed computing and computer networks related courses. In addition, Dr. Ghafoor has developed and taught multiple new courses at TTU in the fields of parallel and distributed computing and networking. He has mentored many graduate and undergraduate students and junior faculty.

Abstract

This talk presents heFFTe as a cross-platform library for scaling up the computation of the three-dimensional Fast Fourier Transform (FFT) on large scale heterogeneous systems with GPUs. With its ability to utilize different communication patterns and different vendor backends, heFFTe also serves as a good benchmark for MPI implementations, such as MVAPICH and OpenMPI, on different system architectures. The talk will present the strong scaling behavior of heFFTe on different systems using both point-to-point and collective communication primitives. We also plan to show preliminary results utilizing the data compression capability in MVAPICH-Plus.


Bio

Greg Becker

Ahmad Abdelfattah, research assistant professor at the Innovative Computing Laboratory at the University of Tennessee, received his PhD in computer science from King Abdullah University of Science and Technology (KAUST) in 2015, where he was a member of the Extreme Computing Research Center (ECRC). His research interests span high performance computing, parallel numerical algorithms, and general purpose GPU computing. He currently serves as the principal investigator of the MAGMA library. Abdelfattah has been acknowledged by NVIDIA and AMD for contributing to their numerical BLAS libraries, cuBLAS and rocBLAS, respectively.

12:45 - 1:30

Lunch Break

Abstract

In this presentation, we will discuss the research on improving collective communication performance using CXL interconnect, a joint effort between OSU and ETRI. CXL (Compute Express Link) is a cutting-edge high-speed interconnect technology that enhances system scalability by efficiently supporting communication among computing resources such as CPUs, memory, accelerators, and storage. This technology is gaining significant attention as it enables composable computing architectures, allowing data center and HPC systems to configure computing resources in a pool, utilizing only the necessary amount to maximize resource efficiency. In particular, this presentation will focus on techniques to improve MPI allgather and reduce scatter performance using iMEX. The proposed technique is expected to significantly enhance communication performance in CXL-enabled multi-node computing environments at the rack scale, compared to conventional methods.


Bio

Dan Stanzione

HooYoung Ahn received the Ph.D. degree in the School of Computing from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea, in 2016. She is currently a Senior Researcher with the Supercomputing System Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea. Her research interests include distributed and parallel computing, artificial intelligence, and high performance computing.

Abstract

This talk will present a brief overview of four software products of X-ScaleSolutions for HPC and AI applications with advanced capabilities and significant performance benefits: 1) MVAPICH2-DPU, 2) X-ScaleSecureMPI, 3) X-ScalePETSc, and 4) X-ScaleAI packages. The MVAPICH2-DPU library takes advantage of the features of NVIDIA Bluefield DPUs to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and AI applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. The X-ScaleSecureMPI library adds security protection in MPI communication for HPC applications with minimal performance overhead. The X-ScalePETSc package offers the same PETSc APIs with performance optimizations that leverage the salient features of MVAPICH. The X-ScaleAI package provides an optimized and integrated software stack for high-performance distributed pre-training, fine-tuning, and inference. It supports models defined in PyTorch or HuggingFace, including large language models and vision models. This talk will present the key features and performance benefits of these products.


Bio

Aamir Shafi

Dr. Donglai Dai is Chief Engineer at X-ScaleSolutions, where he leads the company’s R&D team. His current work focuses on developing and enhancing communication libraries, checkpointing and restart libraries, secure communication libraries, performance analysis tools for distributed and parallel HPC and AI applications on HPC systems and clouds. He is the principal investigator for several DOE SBIR grants and a member of the PC committee of SuperCheck-SC23 workshop. Before joining X-Scale Solutions, he worked at Intel, Cray, and SGI. He holds over 10 US patents and has published over 40 technical papers, presentations, or book chapters. He earned a PhD in computer science from Ohio State University.

2:30 - 3:00

Afternoon Coffee Break

Abstract

The High Performance Computing (HPC) community has widely adopted Message Passing Interface (MPI) libraries to exploit high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. This talk provides an overview of MPI4Spark and MPI4Dask that are enhanced versions of Spark and Dask frameworks, respectively. These stacks can utilize MPI for communication in a parallel and distributed setting on HPC systems connected via fast interconnects. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). It bridges semantic differences between the event-driven communication in Spark compared to the application-driven communication engine in MPI. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs. MPI4Dask provides point-to-point asynchronous I/O communication coroutines, which are non-blocking concurrent operations defined using the async/await keywords from the Python's asyncio framework. The talk concludes by evaluating the performance of MPI4Spark and MPI4Dask on the state-of-the-art HPC systems.


Bio

Aamir Shafi

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express.

3:45 - 5:00

Short Talks

  • PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI, Goutham Kuncham, The Ohio State University
  • Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters, Bharath Ramesh, The Ohio State University
  • Profiling, Storing and Monitoring HPC Communication Data at Scale by OSU INAM, Hari Subramoni, The Ohio State University
  • OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices, Tu Tran, The Ohio State University
  • OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL, Nick Contini, The Ohio State University
    HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions, Bharath Ramesh, The Ohio State University

5:00

Closing Remarks