MUG'22

(Advanced Program)

All Times Are U.S. EDT

Conference Location: Ohio Supercomputer Center Bale Theater

MUG'22 meeting attendees gather for a group photo.

Monday, August 22

Abstract

Omni-Path Express (OPX) is a new libfabric provider for Omni-Path, written from the ground up. It is largely a drop-in replacement for PSM2, and has many of the same requirements as PSM2, such as the hfi1 kernel driver and the opafm service. No changes to the hfi1 driver or fabric manager are needed.

The core logic for the Omni-Path Host Fabric Adapter was written from scratch to be an optimal semantic match to libfabric-enabled applications like mpich, Open MPI, and OpenSHMEM. Most of the embedded system constraints from BGQ have been maintained, giving OPX a small cache line footprint and favorable instruction count metrics when compared with PSM2 for many operations. OPX is still under active development and more enhancements are in plan.

We will present some current benchmarks and profiling information that highlights the gains of OPX over PSM2, as well as the current status of development, stability, and scale.


Bio

Dennis

Dennis Dalessandro is a kernel engineer for Cornelis Networks leading in the development of Omni-Path Architecture HW drivers. He received a BS in Computer Science from The Ohio State University. Over the past 18 years he has been a researcher for the Ohio Supercomputer Center, a performance engineer at NetApp, and a driver developer at Intel. Dennis is a very active supporter of OpenSouce software and enjoys working closely with the Kernel.org community and with various Linux distributions.

Abstract

NVIDIA has introduced a class of Data Processing Units (DPU) named BlueField. These DPU’s are a system on a chip that include an NVIDIA Host Channel Adapter interface, ARM cores, Memory and additional hardware acceleration engines. This presentation will give a brief overview of the BlueField DPU family and how this device is being used to offload the implementation of collective operations from the main host to the DPU, to leverage the asynchronous capabilities provided by these devices. The implementation will be described in the context of the open-source UCC library.


Bio

Richard

Dr. Richard Graham is a Senior Director, HPC Technology at NVIDIA's Networking Business unit. His primary focus is on HPC network software and hardware capabilities for current and future HPC technologies. Prior to moving to Mellanox/NVIDIA, Rich spent thirteen years at Los Alamos National Laboratory and Oak Ridge National Laboratory, in computer science technical and administrative roles, with a technical focus on communication libraries and application analysis tools. He is cofounder of the Open MPI collaboration and was chairman of the MPI 3.0 standardization efforts.

10:30 - 11:00

Morning Coffee Break

Abstract

High Performance Computing (HPC) and Machine Learning (ML) markets are focused on solving complex and performance-intensive problems. The computing and communication demands of HPC and ML applications continue to grow. In this tutorial, we will provide an overview of micro-benchmarks and application benchmarks that are used to measure and evaluate performance of communications in HPC and ML applications. These benchmarks are typically used for evaluating the performance of point-to-point, multi-pair, and individual collective communication operation as well as comparing different communication libraries and interconnects. There is a need to include a new class of benchmarks that will evaluate application communication patterns and measure latency/bandwidth/network utilization under congestion control. We will discuss initial ideas on this new class of benchmarks in the tutorial.


Bio

Hemal Moshe

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all product lines of Ethernet NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Moshe Voloshin is Systems architect in Data Center Solutions Group (DCSG) division at Broadcom Inc. Moshe spearheaded the system architecture development of ROCE and Congestion Control in Broadcom Ethernet NICs, involved in definition of product architecture, modeling, and system simulations. Previously Moshe was a Director, manager, ASIC/HW engineer at Cisco High End router division where he developed and managed the development of Network Processing Unit (NPU), QOS, and fabric ASICs, in products such as GSR and CRS.

12:00 - 1:00

Lunch Break

Abstract

The MVAPICH2-DPU library takes advantage of the DPU features to offload communication components in the MPI library and accelerates HPC applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. This tutorial will provide an overview of the MVAPICH2-DPU product, main features, and acceleration capabilities for a set of representative HPC applications and benchmarks. Live demos of these applications will be shown to demonstrate the capabilities of the latest version of MVAPICH2-DPU product.


Bio

Donglai Kyle Schaefer

Dr. Donglai Dai is the Chief Engineer at X-ScaleSolutions and leads company’s R&D team. He has been the Principal Investigator (PI) for several current and past DOE SBIR grants. His current work focuses on developing scalable efficient communication libraries and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 US granted patents and has published more than 40 technical papers or book chapters. He is a member of the steering committee for software checkpoint restart standard. He has a PhD degree in computer science from The Ohio State University.

Kyle Schaefer is a Software Engineer at X-ScaleSolutions. His current work focuses on continuing the testing, design, and development of the MVAPICH2-DPU project.

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for NVIDIA/AMD GPU-enabled clusters in MVAPICH2-GDR and many-core systems in MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications to demonstrate how one can effectively take advantage of MVAPICH2 for High End Computing applications using MPI and CUDA/OpenACC will also be presented.


Bio

Hari Nat

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

Nat Shineman is a software engineer in the Department of Computer Science and Engineering at the Ohio State University. His current development work includes high performance interconnects, parallel computing, scalable startup mechanisms, and performance analysis and debugging of the MVAPICH2 library.

3:00 - 3:30

Afternoon Coffee Break

Abstract

As the computing, networking, heterogeneous hardware, and storage technologies continue to evolve in HEC platforms, Understanding the full-stack performance tradeoffs and interplay between HPC applications, MPI libraries, the communication fabric, the file system, and the job scheduler becomes more challenging endeavor. Such understanding will enable all involved parties to understand the bottlenecks to maximize the efficiency and performance of the individual components that comprise a modern HPC system and solve different grand challenge problems. Through this tutorial, the participants will learn how to use the OSU InfiniBand Network Analysis and Monitoring (INAM) tool in conjunction with live jobs running on various remote clusters at OSC and OSU to visualize, analyze, and correlate how the MPI runtime, high-performance network, I/O filesystem, and job scheduler interact and identify potential bottlenecks online. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. We will request remote access to the Pitzer system at OSC and the RI/RI2 clusters at OSU for hands-on exercises. This will help to prepare participants to locate and diagnose performance bottlenecks in their own clusters and parallel programs.


Bio

Hari Subramoni Pouya Kousha

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 100 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

Pouya Kousha is currently a PhD student at the Ohio State University, supervised by Prof. DK. Panda. His research interests are Parallel Algorithms and Distributed Systems, High Performance Computing, and Profiling Tools. His work primary focused on scalable and real-time analysis, monitoring and profiling tools and use them to solve the bottlenecks and optimize MPI library and applications. For more information contact him at kousha.2@osu.edu

Abstract

Recent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including TensorFlow, PyTorch, and cuML have emerged that offer high-performance training and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We also present an overview of different DNN architectures and ML/DL frameworks with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU architectures available on modern HPC clusters. Throughout the tutorial, we include hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training on a modern GPU cluster.


Bio

Aamir Shafi Arpan Jain

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express. https://people.engineering.osu.edu/people/shafi.16.

Arpan Jain received his B.Tech. and M.Tech. degrees in Information Technology from ABV-IIITM, India. Currently, Arpan is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He is working on parallelization and distribution strategies for large-scale Deep Neural Network (DNN) training. He previously worked on speech analysis, time series modeling, hyperparameter optimization, and object recognition. He actively contributes to projects like HiDL (high-performance deep learning), MVAPICH2-GDR software, and LBANN deep learning framework. He is a member of IEEE. More details about Arpan are available at https://u.osu.edu/jain.575.

Abstract

There is an increasing interest in adopting higher-level and productive programming languages in emerging areas like Machine/Deep Learning and Data Science. The first part of this tutorial provides an overview of writing parallel MPI applications in Python and Java using mpi4py and MVAPICH2-J, respectively. mpi4py is a popular Python-based wrapper communication library, whereas, MVAPICH2-J is a recently released Java wrapper library from the MVAPICH2 team. The second part of the tutorial discusses our Python and Java extensions to the popular OSU Micro-Benchmarks suite (OMB) benchmark called OMB-Py and OMB-J, respectively. The tutorial concludes with a live demo of running Python and Java parallel applications using MVAPICH2.


Bio

Aamir Shafi Nawras

Aamir Shafi is currently a Research Scientist at the Ohio State University where he is involved in the High Performance Big Data project. Dr. Shafi was a Fulbright Visiting Scholar at MIT where he worked on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express. https://people.engineering.osu.edu/people/shafi.16.

Nawras Alnaasan is currently pursuing a Ph.D. degree in computer science and engineering at The Ohio State University. Working as a graduate research associate at the Network-Based Computing Laboratory, his research interests lie at the intersection of deep learning and high-performance computing. He is actively involved in several projects such as HiDL (high-performance deep learning), OMB (OSU Micro Benchmarks), and ICICLE (Intelligent Cyberinfrastructure with Computational Learning in the Environment). Alnaasan received his B.S. degree in computer science and engineering from The Ohio State University. Contact him at alnaasan.1@osu.edu.

6:30 - 9:30

Reception at Zaftig Brewing

119 E 5th Ave,

Columbus, OH 43201

Tuesday, August 23

8:50 - 9:00

Opening Remarks

Karen Tomko, Director of Research Software Applications, Ohio Supercomputer Center
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

In the latest advanced supercomputers, one of the biggest issues for large scale HPC and data science applications is the memory capacity both for main memory and I/O storage. While the computation performance of CPUs and GPUs has been constantly increased thanks to the advanced semiconductor technology and widely spread SIMD instruction sets, the capacity of main memory cannot catch up so that Byte/FLOPS ratio is getting smaller nowadays. High performance computation power naturally requires large capacity of data space, however their balance becomes difficult to keep. To enlarge the main memory capacity, the final solution is to increase the number of parallel computation nodes which does not solve such a problem of performance/capacity balance.

One of the latest solution for this problem is the persistent memory (PMEM) such as Intel Optane series to introduce storage device technology with byte-addressable memory system which greatly enlarges the capacity of main memory from several hundreds GByte to several TByte per node. While the memory access latency is relatively large compared with traditional DDR, the memory bandwidth is comparably high with them, so that it may be possible to achieve much higher performance than simply increasing the node counts. Moreover, the state-of-the-art PMEM technology provides several powerful features; (1) it can be used as simply very large capacity of main memory, (2) traditional DDR can be utilized as a sort of cache for PMEM to enhance the memory access latency with data locality, and (3) PMEM device can be utilized either/both for addressable memory or/and block addressed I/O device as like as SSD.

In the Center for Computational Sciences at University of Tsukuba, we will introduce a new supercomputer with coupling of the latest GPU and PMEM for each computation node, named Cygnus-BD (tentative) where BD stands for Big Data. The target applications of Cygnus-BD vary over HPC applications which require large memory capacity, big data analysis including in-situ processing, and high performance ad-hoc shared storage by computing nodes. Cygnus-BD is equipped with the latest GPU, NVIDIA H100 PCIe and Intel Optane3 PMEM as well as the latest Intel Xeon (Sapphire Rapids) for each node. This system will be the world first combination of these three components. The system size is medium class with 120 computation nodes, however it provides approximately 6 PFLOPS of peak performance. The system will be deployed in the end of October 2022. We believe our new system opens the new world of Big Data Computing driven by the combination of high performance GPU, CPU and PMEM technologies. In this talk, I will introduce the motivation, concept, design and implementation of Cygnus-BD hardware and system software as well as various target application fields.


Bio

Taisuke

Taisuke Boku has been researching HPC system architecture, system software, and performance evaluation on various scientific applications after he received PhD degree of Electrical Engineering from Keio University, Japan. He is currently the director of Center for Computational Sciences, University of Tsukuba, a co-designing center with both application researchers and HPC system researchers. He has been playing a central roles for development of original supercomputers in the center including CP-PACS (ranked as number one in TOP500 in 1996), FIRST, PACS-CS, HA-PACS and Cygnus systems, the representative supercomputers in Japan. The recent system Cygnus is the world first multi-hybrid accelerated system with GPU and FPGA together. He has been the President of HPCI (High Performance Computing Infrastructure) Consortium in Japan in 2020-2022. He was a member of system architecture working group of Fugaku supercomputer development. He received ACM Gordon Bell Prize in 2011.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X and MVAPICH2-GDR) will be presented. Current status and future plans for OSU INAM and OMB will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Distinguished Professor of Engineering and University Distinguished Scholar at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, and Rockport Networks) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,275 organizations worldwide (in 90 countries). More than 1.6 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 6th, 16th, 30th and 42nd ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop, and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 340 organizations in 38 countries. More than 45,000 downloads of these libraries have taken place. He is an IEEE Fellow. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

10:45 - 11:15

Morning Coffee Break

Abstract

MPI in general, and MVAPICH in particular, are critical infrastructure that have enabled high-performance computing to scale for the past two decades and counting. As users execute applications on larger systems, input and output datasets have increased in scale correspondingly. For example, the data size of an application checkpoint often grows proportionally with the aggregate memory footprint of the system. While MPI applications can access the parallel file system using thousands of processes, users often resort to single-process commands like cp and rm to manage their datasets. Such an extreme resource imbalance makes even basic tasks like copying a checkpoint take significant time, hindering user productivity and workflow. In this talk, I describe how mpiFileUtils provides a solution to this problem by implementing a library and set of data management tools that utilize MPI to scale on HPC systems.


Bio

Adam Moody

Adam Moody is a member of the Development Environment Group within Livermore Computing at Lawrence Livermore National Laboratory. His background is in MPI development, collective algorithms, networking, and parallel I/O. He is a project lead for the Scalable Checkpoint / Restart library and mpiFileUtils -- two projects that use MPI to help users manage large data sets. And he has been a Buckeye fan since birth.

Abstract

Idaho National Laboratory develops and maintains the Multiphysics Object-Oriented Simulation Environment (MOOSE) framework supporting a wide range of applications in both nuclear energy and geothermal science. This talk will discuss the asynchronous communication implemented in MOOSE using MVAPICH2. We also present MVAPICH2 MOOSE benchmarks on our largest systems.


Bio

Matthew Anderson

Matt Anderson is part of the High Performance Computing group at Idaho National Laboratory with specific focus in supporting University and Industry users.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch Break

Abstract

AI and scientific workloads demand ultra-fast processing of high-resolution simulations, extreme-size datasets, and highly parallelized algorithms. As these computing requirements continue to grow, the traditional GPU-CPU architecture further suffers from imbalance computing, data latency and lack of parallel or pre-data-processing. The introduction of the Data Processing Unit (DPU) brings a new tier of computing to address these bottlenecks, and to enable, for the first-time, compute overlapping and nearly zero communication latency. The session will deliver a deep dive into DPU computing, and how it can help address long lasting performance bottlenecks. Performance results of a variety of HPC and AI applications will be presented as well.


Bio

Gilad Shainer

Gilad Shainer serves as senior vice-president of networking at NVIDIA. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF consortium, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

Abstract

Broadcom Ethernet NICs are widely deployed in cloud, enterprise, and telecom markets. HPC, ML, and storage applications in these markets use RDMA via multiple software frameworks including MPI, UCX, verbs, and collective communications libraries. In this talk, we will provide an overview of RDMA software support for Broadcom Ethernet NICs in Linux environments.


Bio

Hemal Shah

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all product lines of Ethernet NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Abstract

The performance and feature gap between bare-metal and Cloud HPC/AI clusters is almost imperceptible on Clouds such as Azure. This is quite evident as Azure Supercomputers have climbed up into the top HPC/AI cluster rankings lists. Public clouds democratize HPC/AI Supercomputers with focus on performance, scalability, and cost-efficiency. As the cloud platform technologies and features continue to evolve, middleware such as MPI libraries and communication runtimes play a key role in enabling applications to make use of the technology advancements, and with high performance. This talk focuses on how MVAPICH2 efficiently enables the latest technology advancements such as SR-IOV, GPU-Direct RDMA, DPU, etc. in virtualized HPC and AI clusters. This talk will also provide an overview of the latest HPC and AI offerings in Microsoft Azure HPC along with their performance characteristics.


Bio

Jithin Jose

Dr. Jithin Jose is a Principal Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and performance optimizations. His research interests include high performance interconnects and protocols, parallel programming models, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

Abstract

The talk will describe research/development and learning/workforce development (LWD) programs within the Office of Advanced Cyberinfrastructure (OAC) in the CISE Directorate at the National Science Foundation. OAC's mission is to support advanced cyberinfrastructure to accelerate discovery and innovation across all science and engineering disciplines. The programs specifically addressed include: the CyberTraining program for research workforce preparation, including the new Cyberinfrastructure (CI) Professional track; the OAC Core Research Program that is part of the CISE Core Research programs solicitation; the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program for creating software and data CI products and services; the CAREER program for faculty early career development, and the CISE Research Initiation Initiative (CRII) for early career faculty who have not yet been a PI on a Federal grant.


Bio

Ashok

Ashok Srinivasan is a Program Director in the Office of Advanced Cyberinfrastructure at the National Science Foundation and is involved in the CyberTraining, CSSI, and OAC Core programs. Srinivasan has a permanent position as a Professor of Computer Science and the William Nystul Eminent Scholar Chair at the University of West Florida and is a Fulbright Fellow. His research interests focus on the applications of high performance computing to science and public health policy. Results of his research to protect public health, especially during air travel, have been highlighted in over 300 news outlets around the world and cited in testimony to the US Congress.

3:20 - 4:00

Student Poster Session (In-Person) and Coffee Break

Nawras Alnaasan, The Ohio State University, OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems
Buddhi Ashan, Mallika Kankanamalage, The University of Texas at San Antonio, Heterogeneous Parallel and Distributed Computing for Efficient Polygon Overlay Computation over Large Polygonal Datasets
Christopher Holder, Florida State University, Layer 2 Scaling
Jurdana Masuma Iqrah, University of Texas at San Antonio, Auto-labeling Sea Ice and Open Water Segmentation and Classification for Sentinel-2 Satellite Imagery in Polar Regions
Pouya Kousha, The Ohio State University, Cross-layer Visualization of Network Communication for HPC Clusters
Hasanul Mahmud, The University of Texas at San Antonio, Toward an Energy-efficient framework for DNN inference at the Edge
Jordi Alcaraz Rodriguez, University of Oregon, Performance Engineering using MVAPICH and TAU via the MPI Tools Interface
Tu Tran, The Ohio State University, Designing Hierarchical Multi-HCA Aware Allgather in MPI
Shulei Xu, The Ohio State University, HPC Meets Clouds: MPI Performance Characterization & Optimization on Emerging HPC Cloud Systems
Yao Xu, Northeastern University, A Hybrid Two-Phase-Commit Algorithm in Checkpointing Collective Communications
Sunyu Yao, Virginia Polytechnic Institute and State University, GenFaaS: Automated FaaSification of Monolithic Workflows
Ahmad Hossein Yazdani, Virginia Polytechnic Institute and State University, Profiling User I/O Behavior for Leadership Scale HPC Systems

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. TAU's support for GPUs including CUDA, DPC++/SYCL, OpenCL, OpenACC, Kokkos, and HIP/ROCm improve performance evaluation of heterogenous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for an LLVM plugin for selective instrumentation for compiler-based instrumentation, support for tracking paths taken by a message, timing synchronization costs in collective operations, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH2 under Amazon AWS using the Extreme-scale Scientific Software Stack (E4S) AWS image. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as a Research Associate Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

Abstract

The NSF funded (award# OAC 1928224) Expanse supercomputer at SDSC is targeted at long-tail workloads with both CPU and GPU based nodes. Expanse's standard compute nodes are each powered by two 64-core AMD EPYC 7742 processors and have 256 GB of DDR4 memory, while each GPU node contains four NVIDIA V100s (32 GB SMX2) connected via NVLINK and dual 20-core Intel Xeon 6248 CPUs. The presentation will cover microbenchmark and application performance results using MVAPICH2 and MVAPICH2-GDR on Expanse.


Bio

Mahidhar

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC as a Computational and Data Science Research Specialist Manager. He has led the support of high-performance computing and data applications software on several NSF and UC funded HPC and AI supercomputers including Voyager, Expanse, Comet, and Gordon at SDSC. His research interests are in HPC architecture and systems, performance and scalability, benchmarking and HPC middleware. He has worked on many NSF funded optimization and parallelization research projects such as MPI performance tuning frameworks, hybrid programming models, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He is co-PI on the NSF funded Expanse HPC system and the National Research Platform projects at SDSC.

5:00 - 5:30

Short Talks

High Performance MPI over Slingshot, Kawthar Shafie Khorassani, The Ohio State University
Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters, Qinghua Zhou, The Ohio State University

5:30 - 5:45

Open MIC Session

6:30 - 9:30

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD

Columbus, OH 43212

Wednesday, August 24

Abstract

The Omni-Path Architecture (OPA) is a high-bandwidth, low-latency network fabric for accelerated computing including traditional high-performance computing, artificial intelligence and machine learning. OPA includes network hardware including host-fabric adapters and switches, as well as software to enable and accelerate host software communication. OPAs software stack support all popular MPI implementations including MVAPICH.

Cornelis Networks continues to evolve the Omni-Path Architecture with both hardware and software innovations. This presentation will describe the state of the Omni-Path Architecture, upcoming developments of interest to the community, early performance results with MVAPICH, and further plans for engagement with the MVAPICH community to ensure continued optimal support for MVAPICH in the future.


Bio

Doug

Douglas Fuller is the director of software development at Cornelis Networks. Doug joined Cornelis from Red Hat, where he served as a software engineering manager leading teams working on the Ceph distributed storage system. Doug’s career in HPC has included stints at various universities and Oak Ridge National Laboratory.

Doug holds bachelor's and master's degrees in computer science from Iowa State University. His master's work at DOE Ames Laboratory involved early one-sided communication models in supercomputers. From his undergraduate days, he remains keenly aware of the critical role of floppy diskettes in Beowulf cluster administration.Doug holds bachelor's and master's degrees in computer science from Iowa State University. His master's work at DOE Ames Laboratory involved early one-sided communication models in supercomputers. From his undergraduate days, he remains keenly aware of the critical role of floppy diskettes in Beowulf cluster administration.

Abstract

Xcompact3d is a Fortran-based framework of high-order finite-difference flow solvers dedicated to the study of turbulent flows using incompressible Navier-Stokes equations. This equation is fully solved in spectral space via the use of relevant 3D Fast Fourier transforms (FFTs). Moreover, due to the nature of the data decomposition, MPI_alltoall(v) primitives are used to rearrange data multiple times each iteration to make computations of derivates fully local to each MPI rank. Starting from the latest public version of Xcompact3D, a proof-of-concept has been developed to experiment with NVIDIA BlueField2 DPU card. The talk aims to present the journey, the challenges, the results and future directions to enable a full application to leverage MPI non-blocking offloading capabilities provided by BlueField DPU product family and MVAPICH2-DPU.


Bio

Filippo

Filippo Spiga is member of the NVIDIA EMEA HPC team working as HPC Developer Relations manager for Arm CPU and GPU. His GPU journey and appreciation for Fortran started back in 2009, at the time of the first GPU Tesla architecture during the early years of CUDA. He is also involved in growing and nurturing a healthy HW and SW ecosystem of GPU-accelerated Arm-based systems, including NVIDIA Grace Superchip. He is an active member of the HPC community and he has been involved in the HPCAIAC Student Cluster Competition since many years.

Abstract

It is the Dell Technologies mission to make HPC systems available to everyone, with an emphasis on ease of use, standards compliance without vendor lock-in, while also advancing HPC through research. Most of Dell’s HPC research is done at the HPC and AI Innovation Lab, which is hosted on the Dell campus in Austin, TX. This presentation gives an overview of the lab’s efforts to get insight into the power footprint of applications and MPI libraries, amended with selected case studies. The author also discusses the challenges that HPC platform vendors like Dell Technologies face in terms of enabling application efficiency, while using massively parallel and multi-core processors, domain specific accelerators, and large-scale parallel storage systems.


Bio

Minsik Kim

Martin Hilgerman joined Dell Technologies in 2011, after having worked as an HPC application specialist for 12 years at SGI and IBM. In 2019, he joined AMD as a senior manager and worked on porting and optimizing the major HPC applications to the “Rome” microarchitecture. Martin returned to Dell Technologies in May 2020 as the HPC performance lead and Distinguished Member of Technical Staff in Dell ISG. He owns a master’s degree in physical chemistry, obtained at the VU University of Amsterdam.

Abstract

How do we get to Zettascale? The Cambridge Open Exascale Lab was formed with the University and its partners to investigate the next generation of supercomputers. In the June 2022 Top 500 list we broke the Exaflops barrier and now the goal posts have moved another 2^10. This talk introduces the newly renamed Zettascale Lab, introduces the themes of it's research and in particular the collaboration with X-ScaleSolutions and the MVAPICH developers.


Bio

CHRIS EDSALL

Christopher Edsall is the head of research software engineering and the University of Cambridge, co-director of the Institute of Computing for Climate Science and acting principal software engineer in the Cambridge Open Zettascale Lab. He has been involved in high performance computing since last millennium when he sysadminned a Cray T3E. Since then he has worked with HPC systems in the area of climate science in several national research institutes helping researchers get the most out of these large resources.

10:30 - 11:00

Morning Coffee Break

Abstract

In this talk, we present a study of the state-of-the art-developments for the computation of distributed FFT on upcoming Exascale supercomputers. We take heFFTe library as a reference library, since it currently is the only one supporting GPU accelerators from NVIDIA, AMD, and Intel, which are expected on Exascale machines. Besides supporting most of the state-of-the-art features, heFFTe provides unique novel features such as:

  • Batched 2-D and 3-D FFTs, very useful for applications in particle and micro-mechanical simulations.
  • FFT Convolutions for digital processing.
  • Sine and Cosine transforms for wave propagation phenomena.

We analyze the effect of different combinations of parametric settings, and show experiments on scalability on over 1 million CPU cores and 6,000 GPUs using the world’s most powerful supercomputers. We analyze the effect of different algorithmic settings, such as tuned processor grids and sizes. Finally, we analyze the well-known communication bottleneck using mathematical and experimental models to study parallel efficiency by leveraging Infiniband, NIC and NVLINK interconnections.


Bio

Stan Tomov Alan Ayala

Stan Tomov received a M.S. degree in Computer Science from Sofia University, Bulgaria, and Ph.D. in Mathematics from Texas A&M University. He is a Research Director in ICL and Research Assistant Professor in the EECS at UTK. Tomov's research interests are in parallel algorithms, numerical analysis, and high performance scientific computing (HPC). Currently, his work is concentrated on the development of numerical linear algebra software, and in particular MAGMA, for emerging architectures for HPC, and heFFTe for FFT distributed computations.

Alan Ayala received a M.S. degree in Applied Mathematics from Pierre et Marie Université, and a PhD. from Sorbonne Université and Inria-Paris. He is a research associate at the Innovative Computing Laboratory (ICL) at the University of Tennessee in Knoxville. Currently, Dr. Ayala's research focuses on the development of heFFTe library for FFT computation on upcoming exascale systems, and the FFT benchmarking software initiative.

Abstract

HPC codes and workflows have increased greatly in complexity since the early days of the MPI standard. In an increasingly complex code integration environment, we rely increasingly on automation to handle tasks that were once the province of all developers, and later build system gurus. For MPI implementations in particular, build system integration is made easier for humans and more difficult to automate by the compiler wrappers bundled with the libraries. In modern containerized workflows, MPI libraries also carry special concerns because they are so tailored to particular systems

We will discuss Spack, the package manager for HPC, and in particular the Spack design considerations for MPI packages and how those designs play out in practice with MVAPICH. We will also discuss ongoing investigations of ABI compatibility relevant to MPI integration in containerized workflows.


Bio

Greg Becker

Gregory Becker is a computer scientist at Lawrence Livermore National Laboratory. His focus is on bridging the gap between research and production software at LLNL. His work in software productization has led him to work on Spack, a package manager for high performance computing, as well scalable I/O formats for performance tools. Gregory has been at LLNL since 2015. He received his B.A. in Computer Science and Mathematics from Williams College in 2015.

Abstract

National Energy Research Scientific Computing Center (NERSC) is a scientific computing facility for the Office of Science in the U.S Department of Energy. NERSC upcoming supercomputer system Perlmutter an HPE Cray EX supercomputer equipped with AMD EPYC CPUs and NVIDIA A100 GPUs requires a high-performance MPI implementation capable of running workloads on CPUs and GPUs.

In this talk we will present the latest updates on MVAPICH2 at Perlmutter. Recently we published a technical report “Software Deployment Process at NERSC” that outlines our software deployment process via spack, in this report we will summarize some of the recent activities pertaining to MVAPICH2 on Perlmutter.

We are using spack to build the HPC software stack known as Extreme-scale Scientific Software Stack (E4S), as part of the deployment. We have built E4S products with mvapich2-gdr on Perlmutter, in this talk we will share our spack configuration and list of packages installed via mvapich2.


Bio

Shahzeb Sameer Prathmesh

Shahzeb Siddiqui is a HPC Consultant/Software Integration Specialist at Lawrence Berkeley National Laboratory at NERSC. He is part of the User Engagement Team that is responsible for engaging with NERSC user community through user support tickets, user outreach, training, documentation. Shahzeb is part of the Exascale Computing Project(ECP) in Software Deployment (SD) group where he is responsible for building Spack Extreme-Scale Scientific Software Stack (E4S) at the DOE facilities.

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as a Research Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc. and ParaTools, SAS.

Prathmesh Sambrekar is a masters student pursuing computer science at Arizona State University and has worked as an intern at Lawrence Berkeley National Laboratory at NERSC. He has helped in the building, deployment, and writing of end user documentation of the Spack Extreme-Scale Scientific Software Stack (E4S).

12:30 - 1:30

Lunch Break

Abstract

TBA


Bio

Dan Stanzione

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin since 2018 and Executive Director of the Texas Advanced Computing Center (TACC) since 2014, is a nationally recognized leader in high performance computing. He is the principal investigator (PI) for a National Science Foundation (NSF) grant to acquire and deploy Frontera, which will be the fastest supercomputer at any U.S. university. Stanzione is also the PI of TACC's Stampede2 and Wrangler systems, supercomputers for high performance computing and for data-focused applications, respectively. For six years he was co-PI of CyVerse, a large-scale NSF life sciences cyberinfrastructure. Stanzione was also a co-PI for TACC's Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

This talk will present an overview of two products with enhanced capabilities by X-ScaleSolutions. The products are: 1) MVAPICH2-DPU communication library using NVIDIA Bluefield DPUs and 2) X-ScaleAI package for HPC and Deep Learning applications. The MVAPICH2-DPU library takes advantage of the features to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and DL applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. The X-ScaleAI package incorporates and optimizes state-of-the-art open-source and in-house components to support popular DL frameworks with outstanding performance and scalability. It focuses on tight integration between MVAPICH2-GDR library and Horovod stack, achieves excellent out-of-the-box performance and one-click deployment and execution. X-ScaleAI supports diverse system architectures including x86-64 and OpenPOWER CPUs, high-performance networks, and NVIDIA GPUs. In addition, it supports efficient and scalable checkpoint-restart operations and has a unique built-in introspection tool for performance analysis.


Bio

Donglai Dai

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 30 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

3:00 - 3:30

Afternoon Coffee Break

Abstract

The Apache Spark software is a popular Big Data processing framework and provides an easy-to-use high-level API in different languages including Scala, Java, and Python. Spark supports parallel and distributed execution of user workloads by supporting communication using an event-driven framework called Netty. This talk presents MPI4Spark that uses MPI for communication to enhance performance and productivity of Big Data workloads. MPI4Spark starts the Spark ecosystem using MPI launchers to utilize MPI communication and maintains isolation for application execution on worker nodes by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it is capable of utilizing popular HPC interconnects including InfiniBand, Omni-Path, Slingshot, and others. The talk concludes by evaluating the performance of MPI4Spark against vanilla Spark and RDMA-Spark using OSU HiBD Benchmarks (OHB) and Intel HiBench.


Bio

Aamir Shafi

Dr. Aamir Shafi is currently a Research Scientist in the Department of Computer Science & Engineering at the Ohio State University where he is involved in the High Performance Big Data project led by Dr. Dhabaleswar K. Panda. Dr. Shafi was a Fulbright Visiting Scholar at the Massachusetts Institute of Technology (MIT) in the 2010-2011 academic year where he worked with Prof. Charles Leiserson on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. He got his Bachelors in Software Engineering degree from NUST, Pakistan in 2003. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express. More details about Dr. Shafi are available from here

4:00 - 5:00

Short Talks

Hey CAI - Conversational AI Enabled User Interface for HPC Tools, Pouya Kousha, The Ohio State University
Hybrid Five-Dimensional Parallel DNN Training for Out-of-core Models, Arpan Jain, The Ohio State University
Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems, Chen-Chun Chen, The Ohio State University
Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries, Kaushik Kandadi Suresh, The Ohio State University
Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems, Bharath Ramesh, The Ohio State University