MUG'25

(Preliminary Program)

All Times Are U.S. EDT

Bale Theater at Ohio Supercomputer Center

Monday, August 18, 2025

Abstract

NVIDIA’s network end-points provide asynchronous engines with computational capabilities and close proximity to the network. These include the Data Path Accelerator (DPA) that is part of the ConnectX network core and the BlueField Processing Unit family of system-on-a-chip devices. These provide ingredients needed for moving communication algorithm management from the host to these devices. NVIDIA’s DOCA environment provides run-time support for DPU and DPA offloaded algorithms. This presentation we will describe these capabilities and how to access them.


Bio

Richard Graham

Dr. Richard Graham is a Senior Director at NVIDIA's Networking Business unit. His primary focus is on HPC and AI network software and hardware capabilities for current and future HPC and AI technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

Abstract

In this tutorial we will demonstrate a shared network attached memory pool solution powered by Enfabrica’s innovative Accelerated Compute Fabric (ACF) device. With the explosive growth of LLMs over the recent years, the AI research and development community is constantly facing challenges stemming from limited memory resources available to the GPUs for keeping the model data, KV cache values and tokens. These challenges are being addressed in a number of ways, including the use of a centralized flash storage system acting as a higher tier cache available to multiple GPUs. Typically these systems act as a slow backup storage to a much faster local DRAM. Enfabrica’s Memory Pool solution solves these bottlenecks by aggregating large amounts of CXL memory with fully shared RDMA connectivity running at 3.2 Tb/s.

We will show how this high-speed shared memory pool can be leveraged for KV cache offload, accelerating inference processing in a multi-conversational AI environment using standard vLLM and LMCache frameworks with the Enfabrica plugin.


Bio

Boris Shpolyansky

Boris is the VP of Customer Engineering at Enfabrica, responsible for leading technical development and engineering support for the company's customer engagements, from initial architectural discussions, through technical evaluations, customer system design, and integration processes, to final production deployments and post-production support. Prior to joining Enfabrica, Boris was a key leader for 6 years at Pensando Systems, including after its acquisition by AMD, during which he formed and led their Customer Engineering team, building effective processes and assets for customer training, technical support, RMA and failure analysis. Previously he spent 19 years at Mellanox in multiple senior chip, software development, solutions, and field applications roles.

Boris holds an MS degree from the Technion - Israel Institute of Technology and has one issued patent.

10:00 - 10:30

Coffee Break, Posters and Demos

  • Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs, Chen-Chun Chen, (Presented by Goutham Kuncham), OSU
  • Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems, Chen-Chun Chen, OSU
  • Low-Power Hybrid Analog-Digital Acceleration on Edge-Class RISC-V Platforms, Cameron Durbin, University of Oregon
  • Water Footprint Modeling, Characterization, and Analysis Toward Water-aware HPC System Design and Operations, Yankai Jiang, Northeastern University
  • Dynamic Sparsification and Comparative Analysis of KV Cache Management for Large-Scale LLM Inference, Oteo Mamo, Florida State University
  • Using BlueField-3 SmartNICs for offloading 1-sided communication, Ben Michalowicz, OSU
  • ANN-to-SNN Conversion: Enabling Energy-Efficient Machine Learning for Edge Devices, Asiful Hoque Prodhan, UTSA
  • A Survey of Scheduling Policies in High-Performance Computing Systems, Kausalya Sankaranarayanan, Northeastern University
  • Enhancing Earthquake Simulation Performance and Efficiency Through GPU-Aware Memory Management and Compression Using MVAPICH, Shijie Wang, Tom Zhang, UCSD
  • Unmasking Performance Variability in GPU Codes on Production Supercomputers, Cunyang Wei, University of Maryland
  • Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs, Lang Xu, OSU
  • Exploiting Inter-Layer Expert Affinity for Mixture-of-Experts Model Inference, Jinghan Yao, OSU

Abstract

HPC and AI clusters increasingly leverage packet switched Ethernet/IP based architectures. The benefits of leveraging commodity technology come with their own set of challenges. We trace the challenges encountered when high throughput, latency sensitive workloads are run over a lossy underlay (like Ethernet/IP) and the solutions adapted by the industry in the last decade to overcome these challenges. We present KAI DC Builder, a tool used successfully to validate, characterize and tune these fabrics.


Bios

Alex Bortok

Alex Bortok is a lead product manager at Keysight Technologies overseeing the development of the product portfolio for AI data center infrastructure validation since 2021. He is a contributor to Ultra Ethernet Consortium compliance and performance working groups and has publications on AI infrastructure benchmarking methodologies. He is a regular speaker at the technology forums.

Alex has experience developing proposals for the Open Compute Project, Open Traffic Generator, OpenConfig and SONiC projects. He is an advocate of test-driven network automation and network CI/CD and has made multiple contributions to the North American Network Operators Group (NANOG), serving on an Education committee, leading Hackathons teams and evangelizing applications of network emulation and testing in network automation.

Ankur Sheth

Ankur Sheth is Senior Director for Strategic Projects at Keysight Technologies. In his current role he oversees the Network Test group’s AI initiatives including KAI DC Builder. With more than two decades of experience in the networking industry, he brings a unique perspective shaped by his expertise across engineering, product management and product marketing. His passion for networking drives him to create breakthrough products for new markets with a strong commitment to placing customer needs at the center of every decision.

Ankur holds a Masters’ in Business Administration from Indian Institute of Management, Ahmedabad and a Masters in Science degree in Electrical Engineering (Computer Networks) from University of Southern California, CA.

Abstract

The talk will present an HPC-AI environment on AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure, based on MVAPICH MPI with support for GPUs. ParaTools Pro for E4S(TM) is a cloud image that includes NVIDIA NeMo(TM) and NVIDIA BioNemo(TM) optimized for GPUs and integrates with a performant remote-desktop based on Adaptive Computing’s Heidi AI/ODDC. It includes MVAPICH MPI, and supports both SLURM and Torque based schedulers. On AWS, it supports x86, aarch64, and Trainium and Inferential nodes. E4S is a curated, Spack based software distribution of 100+ HPC, EDA, and AI/ML packages. It features AI tools such as TensorFlow, PyTorch, NVIDIA NeMo, NVIDIA BioNeMo, vllm, Huggingface CLI, JAX, OpenAI, Google's Gemini API based chatbot, and other supporting tools including langchain, pandas, and SciKit-Learn and supports AWS’ EFA, Google's IPUs, Infiniband on Azure, with the optimized MVAPICH MPI distribution. It includes Codium, an IDE, Jupyter notebook, and visualization tools such as VisIt and ParaView all launched from a web browser without installing any additional software. This multi-user, multi-node, multi-gpu cloud image uses E4S and Spack as the core components for product integration and deployment of a range of HPC and AI/ML tools. These include performance evaluation tools such as TAU, HPCToolkit, DyninstAPI, PAPI, etc. and support both bare-metal and containerized deployment for CPU and GPU platforms. Container runtimes featured in the image include Docker, Singularity, and Charliecloud. E4S is a community effort to provide open-source software packages for developing, deploying, and running scientific applications and tools on HPC platforms. It has built a comprehensive, extensible, coherent software stack that enables application developers to productively develop highly parallel applications that effectively target commercial cloud platforms.


Bio

Sameer Shende

Sameer Shende serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon and the President and Director of ParaTools, Inc. (USA) and ParaTools, SAS (France). He serves as the lead developer of the Extreme-scale Scientific Software Stack (E4S), TAU Performance System, Program Database Toolkit (PDT), and HPC Linux. His research interests include scientific software stacks, performance instrumentation, compiler optimizations, measurement, and analysis tools for HPC. He served as the General Co-Chair for ICPP 2021 and is serving as the General Co-Chair of EuroPar'24. He was the vice chair for technical papers for SC22 and has served as the chair of the Performance Measurement, Modeling, and Tools track at the SC17 conference. He received his B.Tech. in Electrical Engineering from IIT Bombay in 1991, and his M.S. and Ph.D. in Computer and Information Science from the University of Oregon in 1996 and 2001 respectively.

12:00 - 1:30

Lunch Break, Posters and Demos (Cont'd)

Abstract

The tutorial will start with an overview of the MVAPICH and MVAPICH-Plus libraries, OSU MicroBenchmark (OMB) suite, and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility of these libraries in-depth. An overview of configuration and debugging support in MVAPICH libraries will be presented. High-performance support for NVIDIA/AMD/Intel GPU-enabled clusters in MVAPICH-Plus with on-the-fly compression support will be highlighted. Performance and scalability of the MVAPICH libraries for a range of applications on a set of leading systems will be presented.


Bios

Nat Shineman

Nat Shineman is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the MVAPICH2 library.

Ben Michalowicz

Benjamin Michalowicz is a PhD student at the Ohio State University under Prof. DK Panda and Prof. Hari Subramoni in the Network-Based Computing Laboratory. His research interests lie include high-performance computing (HPC), parallel/computer architectures, network-based computing for HPC, security in HPC, and parallel programming environments. Specifically, he is interested in efficiently offloading parallel programming models and computational workloads to Smart Network Cards like NVIDIA's BlueField DPUs. Ben actively contributes to the MVAPICH software and is a student member of the ACM and IEEE. Contact him at michalowicz.2@osu.edu

Abstract

The field of Deep Learning (DL) has witnessed remarkable advances in recent years, paving the way for cutting-edge technologies and leading to exciting challenges and opportunities. Modern DL frameworks, like PyTorch, have emerged to offer high-performance training and deployment for various types of Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in DL, leveraging powerful hardware architectures, interconnects, and distributed frameworks to accelerate both the training of DNNs and the associated inference techniques.

We present an overview of different DNN architectures, focusing on parallelization strategies for model training and optimization for scalable inference. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to support large-scale distributed training and inference. We present MPI-driven solutions (using the MVAPICH-Plus library) to provide large-scale DNN training and inference on modern HPC clusters in a vendor-neutral manner.


Bios

Nawras Alnaasan

Nawras Alnaasan is a Graduate Research Associate at the Network-Based Computing Laboratory, Columbus, OH, USA. He is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. His research interests lie at the intersection of deep learning and high-performance computing. He works on advanced parallelization techniques to accelerate the training of Deep Neural Networks and exploit underutilized HPC resources covering a wide range of DL applications including supervised learning, semi-supervised learning, and hyperparameter optimization. He is actively involved in several research projects including HiDL (High-performance Deep Learning) and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University. Contact him at alnaasan.1@osu.edu.

Jinghan Yao

Jinghan Yao is a Graduate Research Associate at The Ohio State University, specializing in high-performance computing (HPC) and large-scale communication optimization for AI training and inference. His research develops efficient communication runtimes and libraries aimed at accelerating foundation model workloads. Jinghan has published papers at leading conferences including MLSys, IPDPS, and NeurIPS, and has presented his work at NVIDIA GTC 2024 and 2025. He collaborates closely with industry research teams such as Microsoft DeepSpeed to advance scalable communication techniques for AI.

3:00 - 3:30

Coffee Break, Posters and Demos (Cont'd)

Abstract

The growing demand for high-performance computing (HPC) and AI training workloads is driving the evolution of Ethernet fabrics to deliver ultra-low latency, high throughput, and lossless communication. This session presents Broadcom’s differentiated Ethernet solutions for HPC and AI clusters, leveraging two architectural models: switch-scheduled fabrics and end-point-scheduled fabrics. In the switch-scheduled approach, the network fabric plays a central role in traffic management—including adaptive load balancing, multi-path forwarding, and congestion control—while the endpoints remain simplified, supporting RDMA-based communication over RoCEv2. This model offloads complexity from the host and enables highly scalable deployments. Conversely, the end-point-scheduled model shifts control to the NIC or XPU, which performs spraying, congestion management, and retransmission, while the switch provides assistive functions such as ECN marking, packet trimming, and telemetry. This approach enables granular, flow-aware behavior tailored to specific workload needs. We also introduce Broadcom’s scale-up Ethernet fabric enhancements, purpose-built for tightly coupled AI/ML systems. These include Link Layer Retry (LLR) for lossless transmission, Credit-Based Flow Control (CBFC) for headroom-efficient flow regulation, and optimized packet headers that reduce protocol overhead and latency. These features are implemented in merchant silicon to preserve the openness and interoperability of Ethernet. We will present the Tomahawk and Jericho switch families to address the HPC and AI Scale-out and scale-up solutions.


Bio

Mohan Kalkunte

Mohan Kalkunte is the Vice President of Architecture & Technology in the Core Switch Products group at Broadcom. He leads the architecture for Broadcom’s network switching and NIC products, including Trident, Tomahawk, Jericho, and Thor, across Enterprise, Data Center, and Service Provider markets. With over 35 years of experience, Mohan holds a Ph.D. in Engineering from The Ohio State University. He began his career at AT&T Bell Laboratories, later working at AMD, Nortel Networks, and Maverick Networks, which was acquired by Broadcom in 1999. Mohan holds over 150 patents and was elected Broadcom Fellow in 2009, IEEE Fellow in 2013 and to National Academy of Engineering in 2025 for his contributions to Ethernet Switching. He co-authored Gigabit Ethernet: Migrating to High-Bandwidth LANs and pioneered key innovations in Ethernet switch technology.

4:15 - 5:00

Tutorial: Meta

5:30 - 6:30

Visit to the State of Ohio Computer Center, SOCC (Optional)

6:30

Reception and Dinner at Endeavor Brewing and Spirits

909 W 5th Ave, Columbus, OH 43212

Tuesday, August 19, 2025

8:20 - 8:30

Opening Remarks

David Hudak, Ohio Supercomputer Center and Dhabaleswar K (DK) Panda, The Ohio State University

Bio

Dan Stanzione

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin and Executive Director of the Texas Advanced Computing Center (TACC), is a nationally recognized leader in high performance computing, and has been involved in supercomputing for more than 30 years. He is the principal investigator (PI) for a number of the National Science Foundation (NSF) supercomputers, including the current Frontera system, which is the fastest supercomputer at a U.S. university, and is leading the upcoming NSF Leadership Class Computing Facility. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH and MVAPICH-Plus libraries will be presented. Current status and future plans OSU Micro-Benchmarks (OMB) will also be presented.


Bio

Dhabaleswar K Panda

DK Panda is a Distinguished Professor of Engineering and University Distinguished Scholar at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH (High-Performance MPI over InfiniBand, iWARP, RoCE, Omni-Path, EFA, Rockport Networks, and Slingshot) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,450 organizations worldwide (in 92 countries). More than 1.92 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 21st, 67th, and 88th ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop, and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 370 organizations in 39 countries. More than 51,000 downloads of these libraries have taken place. He is a Fellow of ACM and IEEE, a recipient of 2022 IEEE Charles Babbage Award, and a recipient of the 2024 IEEE TCPP Outstanding Service and Contributions Award. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

10:15 - 10:45

Coffee Break, Posters and Demos (Cont'd)

Abstract

NVIDIA networking technologies are being designed for training and Inference AI at scale. In-network computing, highly effective bandwidth, and noise isolation capabilities have facilitated the creation of larger and more complex foundational models. We'll dive deep into the recent technology announcements and their essential roles in next-generation AI data center designs.


Bio

Richard Graham

Dr. Richard Graham is a Senior Director at NVIDIA's Networking Business unit. His primary focus is on HPC and AI network software and hardware capabilities for current and future HPC and AI technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

Abstract

Modern systems built for computation intensive and highly distributed workloads pose very demanding connectivity requirements. Enfabrica’s Accelerated Compute Fabric (ACF) device is designed to address those requirements in a unique way, aggregating host and network side communication at large scale and presenting a uniform data transfer model across both domains. In this presentation we will dive into ACF architecture, unique capabilities and system level benefits as well as discuss various applications and use cases of Enfabrica’s ground-breaking technology, with a special focus on AI/ML applications.


Bio

Srijeet Mukherjee

Shrijeet is co-founder and Chief Development Officer of Enfabrica, where he comprehensively defines and drives the architecture, engineering and innovation pipeline across software, hardware, and systems. Prior to founding Enfabrica, he was an architect in Google’s network infrastructure group. Previously he was VP Engineering at Cumulus building software to revolutionize Open Networking. At Cisco UCS he was Director of Architecture and Software working on the NIC and virtualization systems later known as DPUs. At SGI, he was part of the Advanced Graphics team that invented floating point framebuffers and programmable shaders that underpin today’s GPUs and accelerators.

Shrijeet is on the Linux NetDev Society Board of Directors and has 64 issued patents. He holds an MS in Computer Science from the University of Orego

Abstract

The performance of Artificial intelligence (AI), Machine Learning (ML), and high-performance computing (HPC) applications is determined by job completion times. At large scale, network performance plays a critical role in the job completion time. RDMA over Converged Ethernet (RoCE) is a widely used transport protocol in AI/HPC cluster deployments. RoCE enables high-speed data communications between applications by providing low overhead direct access to RoCE Network Interface Controller (NIC) and enabling direct data transfer between memories of applications without requiring any intermediate copies. RoCE over a large-scale AI/HPC Ethernet network requires a set of enhancements for efficient utilization of network paths and reliable packet delivery. In this talk, we will describe RoCE enhancements including multi-path support, out-of-order data placement, selective retransmissions, selective acknowledgements, and congestion control for large-scale Ethernet networks.


Bio

Hemal Shah

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Core Switching Group (CSG) at Broadcom Inc. Hemal is responsible for the definition of system/software architecture and roadmap of Ethernet NIC product lines. Hemal is one of the key architects of end-2-end Ethernet networking solutions for AI, HPC, storage, and cloud infrastructures. Hemal spearheaded the development of stateless offloads, virtualization, SR-IOV, QoS, RoCE, vSwitch offload, management, and security features of Broadcom NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system architecture of communication processors, 10G Ethernet controllers, and TCP/iSCSI/RDMA/security offloads. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and DMTF. Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers, and 40+ patents. With over 28 years of experience, Hemal holds a Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break, Posters and Demos (Cont'd)

Abstract

The Ultra Ethernet Consortium has released its first standard defining a new network stack to support AI and HPC. Drawing heavily on state of the art technologies, the Ultra Ethernet Specification v.1.0 provides a revolutionary integration of HPC and AI network technologies to lay the groundwork for high performance networking for decades to come.  What is less well known is the journey that it took to get here.  From the first predecessors of the semantics in Portals over 30 years ago to the integration of ephemeral reliability state in Slingshot that enabled the first exascale system,  we explore the path from HPC to the Ultra Ethernet Transport.  If Ultra Ethernet had only targeted HPC, that would be the whole story; however, Ultra Ethernet targets AI deployments in the lossy Ethernet environments at large scale commercial datacenters.  We will go on to explore how the datacenter environment shaped the protocol decisions to be what they are today.


Bio

Keith Underwood

Keith Underwood is a Senior Distinguished Technologist at HPE, and was the author of the Ultra Ethernet Transport Semantic specification and currently serves the editor of both the Semantic and the Congestion Management sections. He also leads architectural work on the next generation Slingshot NIC at HPE. Prior to joining HPE, Keith worked at Sandia National Lab and then Intel, where he helped drive the hardware/software co-design between the architecture that eventually became BXI and Portals 4. He then lead NIC architecture for the OmniPath program at Intel.

Abstract

As Quantum Computing systems mature and make their way out of laboratories into production computing environments, we also must rethink the needed software environments. For one, we must transition from supporting individual physics researchers who are experts in quantum science to a wide range of user communities in the various science disciplines wanting to use the power of quantum computing as HPC accelerators; at the same time we also must ensure a proper integration into HPC workflows, schedulers, and system software and programming environments. This talk will highlight the opportunities and challenges associated with this transition and will introduce the Munich Quantum Software Stack, developed as part of the Munich Quantum Valley (MQV) initiative, with which we tackle these challenges for the upcoming quantum systems deployed by MQV, EuroHPC and beyond.


Bio

Martin W.J. Schulz

Martin Schulz is a Full Professor and Chair for Computer Architecture and Parallel Systems at the Technische Universität München (TUM), which he joined in 2017, as well as a member of the board of directors at the Leibniz Supercomputing Centre. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC.

Martin's research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level, as well as quantum computing and quantum computing architectures and programming, with a special focus on HPC and QC integration.

Martin has published over 250 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface, one of the dominating standard in High-Performance Computing. He was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011. He served on many conference and workshop organizing and program committees, including as program chair for ISC 2021, PC area chair at IPDPS 2021 and general chair of EuroMPI 2021.

Abstract

The TAU Performance System(R) is a versatile performance evaluation toolkit supporting both profiling and tracing modes of measurement. It supports performance evaluation of applications running on CPUs and GPUs and supports runtime-preloading of a Dynamic Shared Object (DSO) that allows users to measure the performance without modifying the source code or binary. This talk will describe how TAU may be used with MVAPICH and support advanced performance introspection capabilities at the runtime layer. TAU's support for tracking the idle time spent in implicit barriers within collective operations in MPI will be demonstrated. TAU also supports event-based sampling at the function, file, and statement level. TAU's support for runtime systems such as CUDA (for NVIDIA GPUs), Level Zero (for Intel oneAPI DPC++/SYCL), ROCm (for AMD GPUs), OpenMP with support for OMPT and Target Offload directives, Kokkos, and MPI allow instrumentation at the runtime system layer while using sampling to evaluate statement-level performance data. Recent advances include support for PC sampling on AMD and NVIDIA GPUs and access to hardware performance counters on GPUs. TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for an LLVM plugin for selective instrumentation for compiler-based instrumentation, support for tracking paths taken by a message, timing synchronization costs in collective operations, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH MPI under Amazon AWS , Google Cloud, Azure, and OCI using the ParaTools Pro for E4S(TM) image.


Bio

Sameer Shende

Sameer Shende serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon and the President and Director of ParaTools, Inc. (USA) and ParaTools, SAS (France). He serves as the lead developer of the Extreme-scale Scientific Software Stack (E4S), TAU Performance System, Program Database Toolkit (PDT), and HPC Linux. His research interests include scientific software stacks, performance instrumentation, compiler optimizations, measurement, and analysis tools for HPC. He served as the General Co-Chair for ICPP 2021 and is serving as the General Co-Chair of EuroPar'24. He was the vice chair for technical papers for SC22 and has served as the chair of the Performance Measurement, Modeling, and Tools track at the SC17 conference. He received his B.Tech. in Electrical Engineering from IIT Bombay in 1991, and his M.S. and Ph.D. in Computer and Information Science from the University of Oregon in 1996 and 2001 respectively.

3:00 - 3:30

Coffee Break, Posters and Demos (Cont'd)

Abstract

This presentation will introduce C-DAC’s indigenously developed Trinetra network along with key features of its software stack. It will provide the details on integration of MPI library MVAPICH4, along with associated performance results. The talk will also present insights and outcomes from tuning configurable parameters within the Trinetra software stack. Additionally, preliminary experiments involving AI frameworks over Trinetra will be discussed.


Bio

Rakesh Kumar Yadav

Mr. Rakesh Kumar Yadav has been associated with the Centre for Development of Advanced Computing (C-DAC) since 2006. He is currently serving as a Scientist E in the HPC Technologies Group at C-DAC, where he leads several key initiatives in system software development, particularly for the Trinetra network. His work also involves the adaptation of MPI libraries for Trinetra. His primary areas of interest include High Performance Computing, Parallel File Systems, Programming models, and Performance tuning. Mr. Yadav holds a Bachelor of Engineering in Information Technology from Maharana Pratap College of Technology, Gwalior, India.

Abstract

Effective AI system design demands robust benchmarking and co-design. Chakra provides a unified graph schema for execution traces, detailing primitive operators, tensor objects and their dependencies. This standardization is crucial for enabling the sharing of proprietary AI model details while maintaining intellectual property protection through obfuscation. We will delve into how these traces can be leveraged by simulators and replay tools to accelerate the development and deployment of next-generation distributed AI platforms.


Bios

Dan Mihailescu

Dan Mihailescu is a senior Software Architect at Keysight Technologies based in Bucharest, Romania, with extensive experience in building networking and security verification systems. Dan currently leads teams at Keysight Technologies Romania focused on developing AI infrastructure emulation solutions and drives collaborations with academia and industry partners.

Winston Liu

Winston Liu is the Chief Architect at Keysight Technologies, leading the development of simulation and emulation solutions to evaluate AI cluster infrastructure. His extensive background includes cloud computing, network visibility, and creating network emulation and simulation testing systems. He has held key engineering leadership positions at Keysight, Ixia, and Xirrus. Winston holds Bachelor's and Master's degrees in Computer Science from UCLA.

4:30 - 5:15

Student Short Talks (OSU)

  • A Collective Streaming Interface for Scale-Out of Dataflow-centric Acceleration, Nick Contini, OSU
  • Performance Evaluation and Optimization of MVAPICH-Plus on SDSC Cosmos: Early Experience, Goutham Kuncham, OSU
  • OMB-Compr: An Extension to OSU Micro Benchmarks for Collective Compression Error Measurement, Jake Queiser, (Presented by Nick Contini), OSU
  • Characterizing Communication Patterns in Distributed Large Language Model Inference, Lang Xu, OSU
  • Training ultra long context language model with fully pipelined distributed transformer, Jinghan Yao, OSU

5:15 - 5:30

Open MIC Session

6:30 - 9:00

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD, Columbus, OH 43212

Wednesday, August 20, 2025

8:30 - 9:30

Keynote Talk: Manoj Wadekar

Manoj Wadekar, Meta

Abstract

This presentation explores how Jump Trading designs HPC networks for a trading environment where performance and availability directly impact the bottom line. I'll examine how our network architecture balances performance with risk tolerance, accommodates exceptionally data-intensive workloads, and manages diverse computational demands. The talk will highlight how our unique design history has led to unconventional but effective architectural decisions that optimize for our specific business context in quantitative trading.


Bio

Shawn Hall

Shawn Hall is a HPC Production Engineer at Jump Trading. His experience is in large scale system administration, having worked with high performance computing clusters in industry and academia. He has worked on many aspects of large scale systems and his interests include parallel file systems, configuration management, performance analysis, system design, and security. Shawn holds a B.S. and M.S. degree in Electrical and Computer Engineering from Ohio State University.

Abstract

This talk is about the ‘Supreme-K’ project which is a significant milestone in South Korea’s pursuit of technological sovereignty in high-performance computing (HPC). As part of this national initiative, we designed and developed core components of an HPC system—including the AB21 accelerator chip, a customized compute node, and an optimized software stack—through a comprehensive hardware-software co-design approach. This talk presents the architecture and implementation of each component, describes the system-level integration and validation methodology, and outlines the future direction of the project toward broader applications and enhanced system scalability.


Bio

Yoomi Park

Dr. Park is a member of Supercomputing System Research Section at ETRI, Daejeon, South Korea. She received her B.S. degree in computer science from Sookmyung Women's University, Seoul, South Korea in 1991, and M.S. and Ph.D. degrees in Computer Engineering from Chungnam National University, Daejeon, South Korea in 1997 and 2010, respectively. Her research interests include High performance computing, distributed deep learning training, and database.

10:30 - 11:00

Coffee Break, Posters and Demos (Cont'd)

Abstract

Climate and weather modeling at high spatial and temporal resolution is increasingly central to decision-making in many fields. While traditional PDE-based solvers and AI models (e.g., transformers) currently dominate this domain, spatial statistical methods offer a robust framework with built-in uncertainty quantification and interpretable inference. Among these, Vecchia-based approximations provide a scalable path forward, but only when tightly coupled with HPC infrastructure. In this talk, we present recent advances in scalable spatial statistics, demonstrating how modern HPC systems can be leveraged to enable Vecchia approximations to operate at climate-relevant scales. We report results on datasets with millions of spatial points. These models introduce a new class of HPC workloads with fine-grained parallelism, irregular communication, and block-sparse computations. These patterns challenge traditional MPI capabilities and require enhancements in collective communication, hybrid parallelism (MPI+X), and memory-aware scheduling. Thus, MPI runtimes should evolve to efficiently support these workloads and enable the next generation of scalable, uncertainty-aware climate models.


Bio

Sameh Abdulah

Sameh Abdulah received his M.S. and Ph.D. degrees from The Ohio State University, Columbus, USA, in 2014 and 2016, respectively. He is currently a Senior Research Scientist at the Extreme Computing Research Center (ECRC) at King Abdullah University of Science and Technology (KAUST), Saudi Arabia. His research interests span high-performance computing (HPC) applications, big data analytics, climate and weather modeling, large-scale spatial datasets, parallel spatial statistics, algorithm-based fault tolerance, and machine learning and data mining algorithms. Sameh was a member of the KAUST team nominated for the ACM Gordon Bell Prize in 2022 and awarded the prize in 2024 (Climate Track) for their contributions to large-scale climate and weather modeling and prediction.

Abstract

We integrate GPU-aware MVAPICH2 in AWP-ODC, a scalable finite difference code for wave propagation in nonlinear media. On OLCF Frontier, HIP-aware MVAPICH2 yields a 17.8% T2S improvement over the non-GPU-aware version and achieves 95% parallel efficiency on 65,536 AMD MI250X GCDs. On TACC Vista, CUDA-aware MVAPICH2 delivers a 3.5% performance gain across 2-256 Nvidia GH200 GPUs, with parallel efficiencies of 82% in the linear case and 92% in the computationally more intense nonlinear case. We deploy the code for production-scale earthquake simulations on leadership-class systems.


Bios

Yifeng Cui

Dr. Yifeng Cui heads the High Performance GeoComputing Lab at SDSC, and helped to establish the Southern California Earthquake Center (SCEC) as a world leader in advancing high performance computing in earthquake system science. Cui’s groundbreaking work includes enabling TeraShake, ShakeOut and M8, some of the worst-case scenarios on San Andreas fault revealing order-of-magnitude LA wave-guide amplification. He is recipient of several HPC awards including 2015 Nvidia Global Impact Award, 2013 IDC HPC innovation excellence award, and 2009/2011 SciDAC OASCR awards. He also directed an Intel Parallel Computing Center on earthquake research. Cui earned his Ph.D. in Hydrology from the University of Freiburg, Germany.

Daniel Roten

Daniel Roten is the head of the Scientific Computing Platform at the Friedrich Miescher Institute for Biomedical Research in Basel, Switzerland. His research focuses on high-performance computing, scalable workflows for big data processing, and machine learning. At the San Diego Supercomputer Center and San Diego State University, he integrated nonlinear rheology and a discontinuous mesh into AWP-ODC, a highly parallel GPU-accelerated wave propagation code. He holds an M.A.S. in Data Science and Engineering from UC San Diego and a Ph.D. from ETH Zurich.

Abstract

Deep Neural Networks (DNN) have become a popular solution for numerous applications. Increasing the speed and efficiency of DNNs is necessary for real-time deployment. Numerous software and hardware strategies exist to achieve faster and more efficient DNNs. Software methods include network compression techniques like quantization, and libraries that utilize hardware to speed computations. Hardware advancements such as increased core counts and accelerators can also improve DNN performance.

Quantization is a major challenge in software due to quantization noise, which affects network accuracy. There are several ongoing research works to minimize quantization noise and maintain high accuracy. Our paper explores this problem from a holistic perspective of software and evolving hardware. Our work presents novel ideas- PTR - Partial Tensor Retention, PTC - Partial Tensor Correction and PTRAC - Combination of PTC and PTR that exploit heterogenous compute using current and future hardware platforms. PTR and PTC improve the accuracy and performance of DNN compute on these platforms with ultra-low precision compute and are efficient compared to several of current quantization techniques.


Bios

Eashan Dash

Eashan Dash is a Machine Learning Engineer at AMD, he has 5+ years of experience in AI, DNNs (Deep Neural Networks), and HPC (High Performance Compute). His primary research areas include DNN acceleration and optimizations at library and framework level, large language model optimization, Numerical linear algebra algorithms, BLAS, Low precision GEMM, SIMD AI kernel algorithms, etc. He holds Bachelor's and Master's (Dual Degree) in Computer Engineering and is an institute gold medalist from the Indian Institute of Information Technology (IIIT), Chennai.

Arun Coimbatore Ramachandran

Arun Coimbatore Ramachandran is a Principal Machine Learning Engineer at AMD, he has 17+ years of industry experience. He is currently pursuing PhD in the Indian Institute of Science (IISc) Bangalore with interest in intersection of Machine Learning and Systems. He was one of the moderators of a popular Machine Learning course by Andrew NG in Coursera for a short period and was an Alpha tester for a deeplearning.ai course ( AI for medicine ). He has to his credit several patent applications filed with USPTO and is also part of Board of Studies member ( VIT Bhopal ).

12:30 - 1:15

Lunch Break, Posters and Demos (Cont'd)

Abstract

heFFTe is an open-source library for the three-dimensional Fast Fourier Transform (FFT) on large scale heterogeneous systems. It supports both CPU abd GPU based systems through robust abstractions for vendor backends, with the ability to utilize GPU-aware MPI for both collective and point-to-point communications. Since FFT is a memory-bound kernel by nature, data movement through the network is the most time-consuming component, which makes heFFTe a real world benchmark for MPI libraries. This talk presents an overview of heFFTe's capabilities and highlight some promising performance results using MVAPICH-Plus using both NVIDIA and AMD GPUs.


Bio

Ahmad Abdelfattah

Ahmad Abdelfattah, research assistant professor at the Innovative Computing Laboratory at the University of Tennessee, received his PhD in computer science from King Abdullah University of Science and Technology (KAUST) in 2015, where he was a member of the Extreme Computing Research Center (ECRC). His research interests span high performance computing, parallel numerical algorithms, and general purpose GPU computing. He currently serves as the principal investigator of the MAGMA library. Abdelfattah has been acknowledged by NVIDIA and AMD for contributing to their numerical BLAS libraries, cuBLAS and rocBLAS, respectively.

Abstract

In this presentation, we will discuss the research on improving collective communication performance using CXL interconnect, a joint effort between OSU and ETRI. CXL (Compute Express Link) is a cutting-edge high-speed interconnect technology that enhances system scalability by efficiently supporting communication among computing resources such as CPUs, memory, accelerators, and storage. This technology is gaining significant attention as it enables composable computing architectures, allowing data center and HPC systems to configure computing resources in a pool, utilizing only the necessary amount to maximize resource efficiency. This presentation will focus on techniques to improve MPI collective communication performance using iMEX (Intelligent Memory Expander). iMEX enables a CXL-based in-network computing architecture that enhances performance by offloading MPI collective operations to the CXL switch. The proposed technique is expected to significantly enhance communication performance in CXL-enabled, rack-scale multi-node computing environments, compared to conventional approaches.


Bio

HooYoung Ahn

HooYoung Ahn received the Ph.D. degree in the School of Computing from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea, in 2016. She is currently a Principal Researcher with the Supercomputing System Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea. Her research interests include distributed and parallel computing, artificial intelligence, and high performance computing.

Abstract

As energy efficiency becomes increasingly important in the era of exascale computing, where power demands exceed 20 megawatts, there are significant efforts to improve power or energy efficiency at various software layers in HPC systems. Regarding MPI, researchers are trying to reduce or eliminate the busy-waiting for completion of communication because that is a major source of CPU energy waste. This presentation describes our research over the past few years to improve the energy efficiency of MPI. We have focused on mechanisms to efficiently support multiple communication channels and flexibly allow various custom energy-saving policies.


Bio

Hyun-Wook Jin

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on system software for high-end computing systems and cyber-physical systems.

Abstract

The Office of Advanced Cyberinfrastructure (OAC) supports and coordinates the development, acquisition, and provision of state-of-the-art cyberinfrastructure resources, tools and services essential to the advancement and transformation of science and engineering. OAC also supports forward-looking research and education to expand the future capabilities of cyberinfrastructure specific to science and engineering. In these efforts, OAC collaborates with all NSF Offices and Directorates to develop models, prototypes, and approaches to research cyberinfrastructure that open new frontiers for discovery, furthering the mission of the National Science Foundation and national science and engineering priorities. The goal of this talk is to expose faculty/researchers to opportunities in the NSF’s CISE/OAC division programs.


Bio

Sheikh Ghafoor

Sheikh Ghafoor is a professor of Computer Science at Tennessee Tech University. Currently he is serving as a program director at the US National Science Foundation in the office of Advanced Cyber Infrastructure in the directorate of Computer and Information Science and Engineering. His main research interests are: 1) High Performance Computing, 2) Computational Earth Science, 3) Computer Security, and 4) Computer Science Education. Dr. Ghafoor has published, secured external grants, and has graduated Ph.D. and Masters student in all these areas. Dr. Ghafoor has taught a wide variety of courses with primary teaching interest in HPC and computer networks-related courses. Dr. Ghafoor has been principal investigator on grants from the National Science Foundation, National Aeronautics and Space Administration, Department of Energy, National Security Agency, and other agencies.

3:00 - 3:30

Coffee Break, Posters and Demos (Cont'd)

Abstract

Cosmos is a NSF Category II testbed high-performance computing system featuring AMD's MI300A accelerated processing units (APUs). The APU features memory shared between CPU and GPU resources. The system is built based on the HPE Cray Supercomputing EX2500 platform and nodes are connected using HPE Slingshot interconnect. The MVAPICH-Plus team developed and tested on Cosmos during the early user period. SDSC staff are testing the MVAPICH-Plus deployment and will present about their experiences and preliminary results using it for AI and HPC applications.


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC as a Computational and Data Science Research Specialist Manager. He has led the support of high-performance computing and data applications software on several NSF and UC funded HPC and AI supercomputers including Voyager, Expanse, Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as MPI performance tuning frameworks, hybrid programming models, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He has also led tutorials on AI, HPC, and Kubernetes topics at several PEARC and SC conferences. He is co-PI on the NSF funded Expanse HPC system and the Prototype National Research Platform (PNRP) projects at SDSC. He is the PI on a NSF funded category II system Cosmos that will feature AMD Instinct MI300A accelerated processing units (APUs) that feature both CPU and GPU capabilities with a unified memory architecture.

Abstract

This talk will present a brief overview of X-ScaleSolutions solutions for HPC and AI applications with advanced capabilities and significant performance benefits: 1) MVAPICH2-DPU, 2) X-ScaleSecureMPI, 3) X-ScalePETSc, and 4) X-ScaleAI packages. The MVAPICH2-DPU library takes advantage of the features of NVIDIA Bluefield DPUs to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and AI applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. The X-ScaleSecureMPI library adds security protection in MPI communication for HPC applications with minimal performance overhead. The X-ScalePETSc package offers the same PETSc APIs with performance optimizations that leverage the salient features of MVAPICH. The X-ScaleAI package provides an optimized and integrated software stack for high-performance distributed pre-training, fine-tuning, and inference. It supports models defined in PyTorch or HuggingFace, including large language models and vision models. This talk will present the key features and performance benefits of these products.


Bio

Kyle Schaefer

Kyle Schaefer is a Senior Software Engineer at X-ScaleSolutions. Kyle leads the development of a suite of HPC software products including X-ScaleAI, X-ScaleHPC and MVAPICH2-DPU. His expertise lies in conceptualizing, designing, and delivering end-to-end solutions for HPC and AI.

4:30 - 5:00

Student Short Talks

  • HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems, Nawras Alnassan, OSU
  • Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication, Chen-Chun Chen, OSU
  • Towards dynamic message passing protocols for stencil-based communication patterns, Kaushik Suresh & Bharath Ramesh, (Presented by Goutham Kuncham), OSU
  • Using BlueField-3 SmartNICs to Offload Vector Operations in Krylov Subspace Methods, Kaushik Suresh, (Presented by Ben Michalowicz), OSU

5:00

Closing Remarks