MUG :: Program

MUG'25

(Preliminary Program)

All Times Are U.S. EDT

Bale Theater at Ohio Supercomputer Center

Monday, August 18, 2025

8:30 - 9:15

Tutorial: NVIDIA

Richard Graham, NVIDIA

Bio

Dr. Richard Graham is a Senior Director, HPC Technology at NVIDIA's Networking Business unit. His primary focus is on HPC network software and hardware capabilities for current and future HPC technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

9:15 - 10:00

Tutorial: Enfabrica

Boris Shpolyansky, Enfabrica

10:00 - 10:30

Coffee Break

10:30 - 11:15

Tutorial: Keysights

Alex Bortok and Ankur Sheth, Keysights

11:15 - 12:00

Tutorial: ParaTools Pro for E4S(TM): An HPC-AI ecosystem for science

Sameer Shende, ParaTools

Abstract

The talk will present an HPC-AI environment on AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure, based on MVAPICH MPI with support for GPUs. ParaTools Pro for E4S(TM) [https://paratoolspro.com] is a cloud image that includes NVIDIA NeMo(TM) and NVIDIA BioNemo(TM) optimized for GPUs and integrates with a performant remote-desktop based on Adaptive Computing’s Heidi AI/ODDC. It includes MVAPICH MPI, and supports both SLURM and Torque based schedulers. On AWS, it supports x86, aarch64, and Trainium and Inferential nodes. E4S is a curated, Spack based software distribution of 100+ HPC, EDA, and AI/ML packages. It features AI tools such as TensorFlow, PyTorch, NVIDIA NeMo, NVIDIA BioNeMo, vllm, Huggingface CLI, JAX, OpenAI, Google's Gemini API based chatbot, and other supporting tools including langchain, pandas, and SciKit-Learn and supports AWS’ EFA, Google's IPUs, Infiniband on Azure, with the optimized MVAPICH MPI distribution. It includes Codium, an IDE, Jupyter notebook, and visualization tools such as VisIt and ParaView all launched from a web browser without installing any additional software. This multi-user, multi-node, multi-gpu cloud image uses E4S and Spack as the core components for product integration and deployment of a range of HPC and AI/ML tools. These include performance evaluation tools such as TAU, HPCToolkit, DyninstAPI, PAPI, etc. and support both bare-metal and containerized deployment for CPU and GPU platforms. Container runtimes featured in the image include Docker, Singularity, and Charliecloud. E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms. It has built a comprehensive, extensible, coherent software stack that enables application developers to productively develop highly parallel applications that effectively target commercial cloud platforms.

Bio

Sameer Shende serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon and the President and Director of ParaTools, Inc. (USA) and ParaTools, SAS (France). He serves as the lead developer of the Extreme-scale Scientific Software Stack (E4S), TAU Performance System, Program Database Toolkit (PDT), and HPC Linux. His research interests include scientific software stacks, performance instrumentation, compiler optimizations, measurement, and analysis tools for HPC. He served as the General Co-Chair for ICPP 2021 and is serving as the General Co-Chair of EuroPar'24. He was the vice chair for technical papers for SC22 and has served as the chair of the Performance Measurement, Modeling, and Tools track at the SC17 conference. He received his B.Tech. in Electrical Engineering from IIT Bombay in 1991, and his M.S. and Ph.D. in Computer and Information Science from the University of Oregon in 1996 and 2001 respectively.

12:00 - 1:30

Lunch Break

1:30 - 2:15

Tutorial: OSU/MVAPICH

Nat Shineman and Ben Michalowicz, OSU

Bios

Nat Shineman is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the MVAPICH2 library.

Benjamin Michalowicz is a PhD student at the Ohio State University under Prof. DK Panda and Prof. Hari Subramoni in the Network-Based Computing Laboratory. His research interests lie include high-performance computing (HPC), parallel/computer architectures, network-based computing for HPC, security in HPC, and parallel programming environments. Specifically, he is interested in efficiently offloading parallel programming models and computational workloads to Smart Network Cards like NVIDIA's BlueField DPUs. Ben actively contributes to the MVAPICH software and is a student member of the ACM and IEEE. Contact him at michalowicz.2@osu.edu

2:15 - 3:00

Tutorial: OSU/HiDL

Nawras Alnassan and Jinghan Yao, OSU

Bio

Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University. Contact him at alnaasan.1@osu.edu.

3:00 - 3:30

Coffee Break

3:30 - 4:15

Tutorial: High Performance Ethernet Solutions for HPC/AI Clusters

Mohan Kalkunte, Broadcom

Abstract

The growing demand for high-performance computing (HPC) and AI training workloads is driving the evolution of Ethernet fabrics to deliver ultra-low latency, high throughput, and lossless communication. This session presents Broadcom’s differentiated Ethernet solutions for HPC and AI clusters, leveraging two architectural models: switch-scheduled fabrics and end-point-scheduled fabrics. In the switch-scheduled approach, the network fabric plays a central role in traffic management—including adaptive load balancing, multi-path forwarding, and congestion control—while the endpoints remain simplified, supporting RDMA-based communication over RoCEv2. This model offloads complexity from the host and enables highly scalable deployments. Conversely, the end-point-scheduled model shifts control to the NIC or XPU, which performs spraying, congestion management, and retransmission, while the switch provides assistive functions such as ECN marking, packet trimming, and telemetry. This approach enables granular, flow-aware behavior tailored to specific workload needs. We also introduce Broadcom’s scale-up Ethernet fabric enhancements, purpose-built for tightly coupled AI/ML systems. These include Link Layer Retry (LLR) for lossless transmission, Credit-Based Flow Control (CBFC) for headroom-efficient flow regulation, and optimized packet headers that reduce protocol overhead and latency. These features are implemented in merchant silicon to preserve the openness and interoperability of Ethernet. We will present the Tomahawk and Jericho switch families to address the HPC and AI Scale-out and scale-up solutions.

Bio

Mohan Kalkunte is the Vice President of Architecture & Technology in the Core Switch Products group at Broadcom. He leads the architecture for Broadcom’s network switching and NIC products, including Trident, Tomahawk, Jericho, and Thor, across Enterprise, Data Center, and Service Provider markets. With over 35 years of experience, Mohan holds a Ph.D. in Engineering from The Ohio State University. He began his career at AT&T Bell Laboratories, later working at AMD, Nortel Networks, and Maverick Networks, which was acquired by Broadcom in 1999. Mohan holds over 150 patents and was elected Broadcom Fellow in 2009, IEEE Fellow in 2013 and to National Academy of Engineering in 2025 for his contributions to Ethernet Switching. He co-authored Gigabit Ethernet: Migrating to High-Bandwidth LANs and pioneered key innovations in Ethernet switch technology.

4:15 - 5:00

Tutorial: Meta

6:30

Reception and Dinner at Endeavor Brewing and Spirits

909 W 5th Ave, Columbus, OH 43212

Tuesday, August 19, 2025

8:20 - 8:30

Opening Remarks

8:30 - 9:30

Keynote Talk: Dan Stanzione

Dan Stanzione, Texas Advanced Computing Center (TACC)

Bio

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin and Executive Director of the Texas Advanced Computing Center (TACC), is a nationally recognized leader in high performance computing, and has been involved in supercomputing for more than 30 years. He is the principal investigator (PI) for a number of the National Science Foundation (NSF) supercomputers, including the current Frontera system, which is the fastest supercomputer at a U.S. university, and is leading the upcoming NSF Leadership Class Computing Facility. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

9:30 - 10:15

Overview of the MVAPICH Project and Future Roadmap

DK Panda

Bio

DK Panda is a Distinguished Professor of Engineering and University Distinguished Scholar at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, Rockport Networks, and Slingshot) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,400 organizations worldwide (in 92 countries). More than 1.81 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 13th, 33rd, and 57th ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop, and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 365 organizations in 39 countries. More than 49,000 downloads of these libraries have taken place. He is an IEEE Fellow and a recipient of 2022 IEEE Charles Babbage Award, and a recipient of the 2024 IEEE TCPP Outstanding Service and Contributions Award. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

10:15 - 10:45

Coffee Break, Posters and Demos

10:45 - 11:15

Richard Graham, NVIDIA

Bio

11:15 - 11:45

Srijeet Mukherjee, Enfabrica

11:45 - 12:15

Hemal Shah, Broadcom

Bio

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all existing Broadcom Ethernet NIC product lines. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of HPC/ML clusters with RoCE, TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds a Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break, Posters and Demos

1:30 - 2:00

Keith Underwood, HPE

2:00 - 2:30

Martin Schulz, Technical University of Munchen, Germany

Martin W.J. Schulz, Technical University of Munchen

2:30 - 3:00

TAU Performance System(R)

Sameer Shende, University of Oregon

Abstract

The TAU Performance System(R) [https://tau.uoregon.edu] is a versatile performance evaluation toolkit supporting both profiling and tracing modes of measurement. It supports performance evaluation of applications running on CPUs and GPUs and supports runtime-preloading of a Dynamic Shared Object (DSO) that allows users to measure the performance without modifying the source code or binary. This talk will describe how TAU may be used with MVAPICH and support advanced performance introspection capabilities at the runtime layer. TAU's support for tracking the idle time spent in implicit barriers within collective operations in MPI will be demonstrated. TAU also supports event-based sampling at the function, file, and statement level. TAU's support for runtime systems such as CUDA (for NVIDIA GPUs), Level Zero (for Intel oneAPI DPC++/SYCL), ROCm (for AMD GPUs), OpenMP with support for OMPT and Target Offload directives, Kokkos, and MPI allow instrumentation at the runtime system layer while using sampling to evaluate statement-level performance data. Recent advances include support for PC sampling on AMD and NVIDIA GPUs and access to hardware performance counters on GPUs. TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for an LLVM plugin for selective instrumentation for compiler-based instrumentation, support for tracking paths taken by a message, timing synchronization costs in collective operations, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH MPI under Amazon AWS , Google Cloud, Azure, and OCI using the ParaTools Pro for E4S(TM) image.

Bio

3:00 - 3:30

Coffee Break, Posters and Demos

3:30 - 4:00

Rakesh Kumar Yadav, CDAC, India

4:00 - 4:30

Dan Mihailescu and Winston Liu, Keysight

4:30 - 5:15

Student Short Talks (OSU)

5:15 - 5:30

Open MIC Session

6:30 - 9:00

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD, Columbus, OH 43212

Wednesday, August 20, 2025

8:30 - 9:30

Keynote Talk: Manoj Wadekar

Manoj Wadekar, Meta

9:30 - 10:00

HPC Network Design in Finance

Shawn Hall, Jump Trading

Abstract

This presentation explores how Jump Trading designs HPC networks for a trading environment where performance and availability directly impact the bottom line. I'll examine how our network architecture balances performance with risk tolerance, accommodates exceptionally data-intensive workloads, and manages diverse computational demands. The talk will highlight how our unique design history has led to unconventional but effective architectural decisions that optimize for our specific business context in quantitative trading.

Bio

Shawn Hall is a HPC Production Engineer at Jump Trading. His experience is in large scale system administration, having worked with high performance computing clusters in industry and academia. He has worked on many aspects of large scale systems and his interests include parallel file systems, configuration management, performance analysis, system design, and security. Shawn holds a B.S. and M.S. degree in Electrical and Computer Engineering from Ohio State University.

10:00 - 10:30

Supreme-K: A Korean Initiative for Pioneering the AB21 HPC Accelerator and Compute Node

Yoomi Park, ETRI, South Korea

Abstract

This talk is about the ‘Supreme-K’ project which is a significant milestone in South Korea’s pursuit of technological sovereignty in high-performance computing (HPC). As part of this national initiative, we designed and developed core components of an HPC system—including the AB21 accelerator chip, a customized compute node, and an optimized software stack—through a comprehensive hardware-software co-design approach. This talk presents the architecture and implementation of each component, describes the system-level integration and validation methodology, and outlines the future direction of the project toward broader applications and enhanced system scalability.

Bio

Dr. Park is a member of Supercomputing System Research Section at ETRI, Daejeon, South Korea. She received her B.S. degree in computer science from Sookmyung Women's University, Seoul, South Korea in 1991, and M.S. and Ph.D. degrees in Computer Engineering from Chungnam National University, Daejeon, South Korea in 1997 and 2010, respectively. Her research interests include High performance computing, distributed deep learning training, and database.

10:30 - 11:00

Coffee Break, Posters and Demos

11:00 - 11:30

Sameh Abdulah, KAUST, Saudi Arabia

Bio

Sameh Abdulah obtained his MS and Ph.D. degrees from Ohio State University, Columbus, USA, in 2014 and 2016, respectively. Presently, he serves as a research scientist at the Extreme Computing Research Center (ECRC), King Abdullah University of Science and Technology, Saudi Arabia. His research focuses on various areas, including high-performance computing applications, big data, bitmap indexing, handling large spatial datasets, parallel spatial statistics applications, algorithm-based fault tolerance, and machine learning and data mining algorithms. Sameh was a part of the KAUST team nominated for the ACM Gordon Bell Prize in 2022 and 2024 (climate track) for their work on large-scale climate/weather modeling and prediction.

11:30 - 12:00

Yifeng Cui and Daniel Roten, San Diego Supercomputing Center (SDSC)

Yifeng Cui, SDSC and Daniel Roten, SDSC

Bio

Dr. Yifeng Cui heads the High Performance GeoComputing Lab at SDSC, and helped to establish the Southern California Earthquake Center (SCEC) as a world leader in advancing high performance computing in earthquake system science. Cui’s groundbreaking work includes enabling TeraShake, ShakeOut and M8, some of the worst-case scenarios on San Andreas fault revealing order-of-magnitude LA wave-guide amplification. He is recipient of several HPC awards including 2015 Nvidia Global Impact Award, 2013 IDC HPC innovation excellence award, and 2009/2011 SciDAC OASCR awards. He also directed an Intel Parallel Computing Center on earthquake research. Cui earned his Ph.D. in Hydrology from the University of Freiburg, Germany.

12:00 - 12:30

Hyun-Wook Jin, Konkuk University, South Korea

Bio

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on system software for high-end computing systems and cyber-physical systems.

12:30 - 1:15

Lunch Break, Posters and Demos

1:15 - 1:45

Ahmad Abdelfattah, University of Tennessee, Knoxville

Bio

Ahmad Abdelfattah, research assistant professor at the Innovative Computing Laboratory at the University of Tennessee, received his PhD in computer science from King Abdullah University of Science and Technology (KAUST) in 2015, where he was a member of the Extreme Computing Research Center (ECRC). His research interests span high performance computing, parallel numerical algorithms, and general purpose GPU computing. He currently serves as the principal investigator of the MAGMA library. Abdelfattah has been acknowledged by NVIDIA and AMD for contributing to their numerical BLAS libraries, cuBLAS and rocBLAS, respectively.

1:45 - 2:15

Enhancing MPI Collective Communication Performance Using CXL Shared Memory and an Intelligent CXL Switch

HooYoung Ahn, ETRI, South Korea

Abstract

In this presentation, we will discuss the research on improving collective communication performance using CXL interconnect, a joint effort between OSU and ETRI. CXL (Compute Express Link) is a cutting-edge high-speed interconnect technology that enhances system scalability by efficiently supporting communication among computing resources such as CPUs, memory, accelerators, and storage. This technology is gaining significant attention as it enables composable computing architectures, allowing data center and HPC systems to configure computing resources in a pool, utilizing only the necessary amount to maximize resource efficiency. This presentation will focus on techniques to improve MPI collective communication performance using iMEX (Intelligent Memory Expander). iMEX enables a CXL-based in-network computing architecture that enhances performance by offloading MPI collective operations to the CXL switch. The proposed technique is expected to significantly enhance communication performance in CXL-enabled, rack-scale multi-node computing environments, compared to conventional approaches.

Bio

HooYoung Ahn received the Ph.D. degree in the School of Computing from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea, in 2016. She is currently a Principal Researcher with the Supercomputing System Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea. Her research interests include distributed and parallel computing, artificial intelligence, and high performance computing.

2:15 - 2:45

Mahidhar Tatineni, San Diego Supercomputing Center

Bio

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC as a Computational and Data Science Research Specialist Manager. He has led the support of high-performance computing and data applications software on several NSF and UC funded HPC and AI supercomputers including Voyager, Expanse, Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as MPI performance tuning frameworks, hybrid programming models, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He has also led tutorials on AI, HPC, and Kubernetes topics at several PEARC and SC conferences. He is co-PI on the NSF funded Expanse HPC system and the Prototype National Research Platform (PNRP) projects at SDSC. He is the PI on a NSF funded category II system Cosmos that will feature AMD Instinct MI300A accelerated processing units (APUs) that feature both CPU and GPU capabilities with a unified memory architecture.

2:45 - 3:00

Sheikh Ghafoor, NSF

3:00 - 3:30

Coffee Break, Posters and Demos

3:30 - 4:00

Kyle Schaefer, X-ScaleSolutions

Bio

Kyle Schaefer is a Software Engineer at X-ScaleSolutions. His current work focuses on continuing the testing, design, and development of the MVAPICH2-DPU project.

4:00 - 5:00

Student Short Talks

5:00

Closing Remarks

13th Annual MVAPICH User Group (MUG) Conference

MUG'25

(Preliminary Program)

All Times Are U.S. EDT

Bale Theater at Ohio Supercomputer Center

Monday, August 18, 2025

8:30 - 9:15

Tutorial: NVIDIA

Richard Graham, NVIDIA

Bio

9:15 - 10:00

Tutorial: Enfabrica

Boris Shpolyansky, Enfabrica

10:00 - 10:30

Coffee Break

10:30 - 11:15

Tutorial: Keysights

Alex Bortok and Ankur Sheth, Keysights

11:15 - 12:00

Tutorial: ParaTools Pro for E4S(TM): An HPC-AI ecosystem for science

Sameer Shende, ParaTools

Abstract

Bio

12:00 - 1:30

Lunch Break

1:30 - 2:15

Tutorial: OSU/MVAPICH

Nat Shineman and Ben Michalowicz, OSU

Bios

2:15 - 3:00

Tutorial: OSU/HiDL

Nawras Alnassan and Jinghan Yao, OSU

Bio

3:00 - 3:30

Coffee Break

3:30 - 4:15

Tutorial: High Performance Ethernet Solutions for HPC/AI Clusters

Mohan Kalkunte, Broadcom

Abstract

Bio

4:15 - 5:00

Tutorial: Meta

6:30

Reception and Dinner at Endeavor Brewing and Spirits

909 W 5th Ave, Columbus, OH 43212

Tuesday, August 19, 2025

8:20 - 8:30

Opening Remarks

8:30 - 9:30

Keynote Talk: Dan Stanzione

Dan Stanzione, Texas Advanced Computing Center (TACC)

Bio

9:30 - 10:15

Overview of the MVAPICH Project and Future Roadmap

DK Panda

Bio

10:15 - 10:45

Coffee Break, Posters and Demos

10:45 - 11:15

Richard Graham, NVIDIA

Richard Graham, NVIDIA

Bio

11:15 - 11:45

Srijeet Mukherjee, Enfabrica

Srijeet Mukherjee, Enfabrica

11:45 - 12:15

Hemal Shah, Broadcom

Bio

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break, Posters and Demos

1:30 - 2:00

Keith Underwood, HPE

Keith Underwood, HPE

2:00 - 2:30

Martin Schulz, Technical University of Munchen, Germany

Martin W.J. Schulz, Technical University of Munchen

2:30 - 3:00

TAU Performance System(R)