MUG :: Program

MUG'21

(Final Program)

All Times Are U.S. EDT

Monday, August 23

10:00 - 11:00

Tutorial: Standing up a HPC cluster on AWS using ParallelCluster

Angel Pizarro, AWS

PDF

YouTube

Abstract

This tutorial will walk through the basics of setting up a scalable Slurm-based cluster on AWS, highlighting some features such as shared filesystems, and different instance types.

Bio

Angel is a Principal Developer Advocate for HPC and scientific computing. His background is in bioinformatics application development and building system architectures for scalable computing in genomics and other high throughput life science domains.

11:00 - 12:00

Tutorial: UCC and SHARP: Software and Hardware Building Blocks for Next Generation Collectives

Devendar Bureddy and Manjunath Gorentla Venkata, NVIDIA

PDF

YouTube

Abstract

UCC and SHARP are important building blocks for collective operations for HPC and AI/DL workloads. In this talk, we will provide a brief overview of both these solutions. UCC is a community-driven effort to develop collective API and library implementation for applications in various domains, including High-Performance Computing, Artificial Intelligence, Data Center, and I/O. Over the last year, the UCC WG group has met weekly to develop the UCC specification. In this talk, we will highlight some of the design principles of the UCC v1.0 specification. Then, we will also share the status of UCC implementation and the upcoming plans of the working group. The UCC provides a user-facing public API and library implementation which leverages software protocols, and hardware solutions to implement collective operations. One of the important and successful hardware implementations of collective operations is SHARP. After introducing UCC, in the last part of the talk, we provide a brief overview of SHARP. SHARP has been successfully powering HPC and AI/WL workloads through collective libraries such as HCOLL, and SHARP.

Bio

Devendar Bureddy is a Principal SW Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHARP,UCX, HCOLL..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory led by Dr. D. K. Panda, involved in the design and development of MVAPICH. He had received his Master’s degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC/DL software.

Manjunath Gorentla Venkata is a Principal Software Architect at NVIDIA. His focus is on architecting features for current and next-generation NVIDIA's networking products, programming models, and network libraries to address the needs of HPC and AI/DL systems and workloads. Previously, he was a research scientist and the Languages Team lead at Oak Ridge National Laboratory. While at ORNL, he researched, designed, and developed several innovative and high-performing communication middleware for HPC systems, including InfiniBand systems and Cray (XK7, XE). He's served on open standards committees for parallel programming models, including OpenSHMEM and MPI for many years, and he is the author of more than 50 research papers in this area. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

12:00 - 1:00

Tutorial: Addressing network congestion with a switchless ultra-low latency HPC Ethernet solution from Rockport with market-leading performance under load.

Matthew (Matt) Williams, Chief Technical Officer, Rockport Networks

PDF

YouTube

Abstract

Poor workload performance due to network congestion is a well-understood challenge for HPC. This talk covers the ultra-low latency direct interconnect switchless HPC networking solution from Rockport. This switchless solution is fully supported by the latest MVAPICH2 library and delivers consistently low latency, even under heavy load from competing noisy-neighbor workloads. Matthew will provide an overview of Rockport's innovative distributed switchless architecture, advances in congestion protection, resiliency, and operational simplicity, and show best practices for benchmarking to predict performance in production environments. We will also demonstrate how to quickly deploy an MVAPICH2 powered Rockport network cluster.

Bio

Matt is a proven senior management professional with 20 years of technical leadership and engineering experience, 11 years as CTO of a successful network technology company and 8 issued US patents. Matt is an expert strategist, analyst and visionary who has delivered on strong product visions and obtained buy-in at the highest levels of Fortune 50 companies. He is an insightful and energetic communicator who enjoys product evangelization and inspiring global business and technical audiences. He has a degree in Electrical Engineering from Queen's University.

1:00 - 1:30

Break

1:30 - 2:30

Tutorial: Running Scalable Clusters on Oracle Cloud Infrastructure

Marcin Zablocki, Oracle

PDF

YouTube

Abstract

Highly scalable clusters are a critical part of HPC. Oracle Cloud's HPC cluster platform combines the bare-metal performance of an on-premises HPC system with the flexible, on-demand nature of a cloud architecture. In this tutorial, we provide a brief background on Oracle Cloud and modern cloud architectures, clusters, and scalability. Then we run through a step-by-step approach on how to provision clusters on Oracle Cloud. We demonstrate the flexibility of the architecture, along with its user-friendliness, performance, and scalability.

Bio

Marcin Zablocki is a master principal solutions architect in Oracle. He is the maintainer of cluster automation and HPC imaging -- two products used by majority or High Performance Computing customers on Oracle Cloud.

2:30 - 3:40

Tutorial: Boosting Performance of HPC Applications with MVAPICH2

Hari Subramoni, The Ohio State University

PDF

YouTube

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for NVIDIA/AMD GPU-enabled clusters in MVAPICH2-GDR and many-core systems in MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications to demonstrate how one can effectively take advantage of MVAPICH2 for High End Computing applications using MPI and CUDA/OpenACC will also be presented.

Bio

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, deep learning and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

3:40 - 4:55

Tutorial: Boosting Performance of Deep Learning, Machine Learning and Dask with MVAPICH2

Aamir Shafi, Arpan Jain and Quentin Anthony, The Ohio State University

PDF

YouTube

Abstract

Recent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL and Data Science frameworks including TensorFlow, PyTorch, and Dask have emerged that offer high-performance training and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of parallelization strategies for distributed training and highlights new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU architectures available on modern HPC clusters. The tutorial covers training traditional ML models including -- K-Means, linear regression, nearest neighbors -- using the cuML framework accelerated using MVAPICH2-GDR. Also, the tutorial presents accelerating GPU-based Data Science applications using MPI4Dask, which is an MPI-based backend for Dask.

Bio

Dr. Aamir Shafi is currently a Research Scientist in the Department of Computer Science & Engineering at the Ohio State University where he is involved in the High Performance Big Data project led by Dr. Dhabaleswar K. Panda. Dr. Shafi was a Fulbright Visiting Scholar at the Massachusetts Institute of Technology (MIT) in the 2010-2011 academic year where he worked with Prof. Charles Leiserson on the award-winning Cilk technology. Dr. Shafi received his PhD in Computer Science from the University of Portsmouth, UK in 2006. He got his Bachelors in Software Engineering degree from NUST, Pakistan in 2003. Dr. Shafi’s current research interests include architecting robust libraries and tools for Big Data computation with emphasis on Machine and Deep Learning applications. Dr. Shafi co-designed and co-developed a Java-based MPI-like library called MPJ Express. More details about Dr. Shafi are available from here

Arpan Jain received his B.Tech. and M.Tech. degrees in Information Technology from ABV-IIITM, India. Currently, Arpan is working towards his Ph.D. degree in Computer Science and Engineering at The Ohio State University. His current research focus lies at the intersection of High-Performance Computing (HPC) libraries and Deep Learning (DL) frameworks. He is working on parallelization and distribution strategies for large-scale Deep Neural Network (DNN) training. He previously worked on speech analysis, time series modeling, hyperparameter optimization, and object recognition. He actively contributes to projects like HiDL (high-performance deep learning), MVAPICH2-GDR software, and LBANN deep learning framework. He is a member of IEEE. More details about Arpan are available here.

Quentin Anthony is a PhD student at The Ohio State University. He received a B.S. in physics from The Ohio State University. His current research is primarily focused on the intersection of Deep Learning frameworks and High Performance Computing. He actively contributes to the MVAPICH2 project and its subprojects such as MVAPICH2-GDR (High Performance MPI for GPU clusters), and HiDL (High Performance Deep Learning).

4:55 - 5:30

Tutorial and Live Demo: Accelerating HPC Applications with MVAPICH2-DPU and Live Demos

Donglai Dai, Richmond Liewand and Nick Sarkauskas, X-ScaleSolutions

PDF

YouTube

Abstract

The MVAPICH2-DPU library takes advantage of the DPU features to offload communication components in the MPI library and accelerates HPC applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. This tutorial will provide an overview of the MVAPICH2-DPU product, main features, and acceleration capabilities for a set of representative HPC applications and benchmarks. Live demos of these applications will be shown to demonstrate the capabilities of the MVAPICH2-DPU product.

Bio

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 30 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

Richmond is a Junior Software Engineer at X-ScaleSolutions. His main responsibilities are centered around building the testing infrastructure and development of the MVAPICH2-DPU project.

Nick Sarkauskas is a Software Engineer at X-ScaleSolutions and a M.S. student in Computer Science and Engineering at The Ohio State University. His current work at X-ScaleSolutions is on the design and development of the MVAPICH2-DPU software stack. His research interests include High-Performance Computing, high-performance interconnects, and parallel algorithms. Nick Sarkauskas received a B.S. degree in Computer Science and Engineering from The Ohio State University in 2020. More details are available at nsarka.com.

Tuesday, August 24

10:00 - 10:15

Opening Remarks

Dave Hudak, Executive Director, Ohio Supercomputer Center

Dhabaleswar K (DK) Panda, The Ohio State University

10:15 - 11:15

Keynote: Introducing Cloud-Native Supercomputing: Bare-Metal, Secured Supercomputing Architecture

Gilad Shainer, NVIDIA/Mellanox

PDF

YouTube

Abstract

High-performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use, hosting a variety of users and applications. While providing the highest performance, supercomputers must also offer multi-tenancy security. Therefore they need to be designed as cloud-native platforms. The key element that enables this architecture is the data processing unit (DPU). DPU is a fully integrated data-center-on-a-chip platform that can manage the data center operating system instead of the host processor, enabling security and orchestration of the supercomputer. This architecture enables supercomputing platforms to deliver bare-metal performance, while natively supporting multi-node tenant isolation. We'll introduce the new supercomputing architecture, and include first applications performance results.

Bio

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

11:15 - 12:00

Overview of the MVAPICH Project and Future Roadmap

Dhabaleswar K (DK) Panda, The Ohio State University

PDF

YouTube

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X and MVAPICH2-GDR) will be presented. Current status and future plans for OSU INAM and OMB will also be presented.

Bio

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, and Rockport Networks) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,200 organizations worldwide (in 89 countries). More than 1.4 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 4th, 10th, 20th and 31st ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 340 organizations in 38 countries. More than 40,000 downloads of these libraries have taken place. He is an IEEE Fellow. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

12:00 - 12:30

How can the Scalable Checkpoint / Restart library help MPI users?

Adam Moody, Lawrence Livermore National Laboratory

PDF

YouTube

Abstract

The Scalable Checkpoint / Restart (SCR) library utilizes the speed of MPI to enable applications to write checkpoint and output datasets at very high bandwidths, up to orders of magnitude faster than the parallel file system. With a focus on broadening its user base and system portability, SCR has been extended in numerous ways in its version 3.0 release. One can now use SCR on essentially any HPC system. Support has been added for applications that read and write shared files. New redundancy schemes simplify configuration for different failure modes. The library now offers Python bindings. In this talk, I will present an overview of SCR and describe how these new capabilities and others help MPI users increase I/O performance and reduce execution turn around time.

Bio

Adam is a member of the Development Environment Group within Livermore Computing at Lawrence Livermore National Laboratory. His background is in MPI development, collective algorithms, networking, and parallel I/O. He is a project lead for the Scalable Checkpoint / Restart library and mpiFileUtils -- two projects that use MPI to help users manage large data sets. And he has been a Buckeye fan since birth.

12:30 - 1:00

MVAPICH Integration for PBS Pro

HPC Team, Idaho National Laboratory

PDF

YouTube

Abstract

Idaho National Laboratory maintains nearly 9 Petaflops of High Performance Computing resources supporting both leadership and engineering simulations using MVAPICH across a wide range of disciplines. This talk will cover recently added improved integration of MVAPICH with PBS Pro and how we use this integration for debugging and job monitoring. Finally, we present MVAPICH benchmarks on our largest systems.

Bio

Matt Anderson is part of the High Performance Computing group at Idaho National Laboratory with specific focus in supporting University and Industry users.

1:00 - 1:30

Break

1:30 - 2:00

RoCEv2 Congestion Control Evaluation for Large Scale HPC and ML Deployments

Hemal Shah and Moshe Voloshin, Broadcom

PDF

YouTube

Abstract

With the availability of 100 Gbps Ethernet and RoCEv2, Ethernet is replacing InfiniBand in High Performance Computing (HPC) and Machine Learning (ML) environments. There is a perception that RoCEv2 does not scale well and requires Priority-based Flow Control (PFC). The use of PFC only can lead to head-of-line blocking that results in traffic interference. RoCEv2 with ECN-based Congestion Control (CC) scales well without requiring PFC. In this talk, we will present performance evaluation of RoCEv2 congestion control schemes for OSU benchmarks, HPCG, LAMMPS, and GPCNeT in three different configurations: PFC without CC, PFC with CC, and CC without PFC.

Bio

Hemal Shah is a Distinguished Engineer and Systems/Software/Standards architect in the Data Center Solutions Group (DCSG) division at Broadcom Inc. He leads and manages a team of architects. Hemal is responsible for the definition of product architecture and software roadmap/architecture of all product lines of Ethernet NICs. Hemal led the architecture definition of several generations of NetXtreme® E-Series/NetXtreme I server product lines and NetXtreme I client product lines. Hemal spearheaded the system architecture development of TruFlowTM technology for vSwitch acceleration/packet processing software frameworks, TruManageTM technology for system and network management, device security features, virtualization and stateless offloads. Hemal has defined the system architecture of RDMA hardware/software solutions for more than two decades. Before joining Broadcom in 2005, Hemal worked at Intel Corporation where he led the development of system/silicon/software architecture of communication processors, 10 Gigabit Ethernet controllers, TCP/iSCSI/RDMA offloads, and IPsec/SSL/firewall/VPN accelerations. Hemal is the lead technical representative/contributor from Broadcom Inc. in the Open Compute Project (OCP) and Distributed Management Task Force (DMTF). Hemal serves as Senior VP of Technology in the DMTF and a project co-lead of OCP Hardware Management project. Hemal has co-authored several OCP specifications, 70+ DMTF specifications, four IETF RFCs, and 10 plus technical conference/journal papers. Hemal is a named inventor on 40+ patents with several pending patents. Hemal holds Ph. D. (computer engineering) and M.S. (computer science) degrees from Purdue University, M.S.E.E. degree from The University of Arizona, and B.S. (electronics and communication engineering) degree from Gujarat University, India.

Moshe Voloshin is Systems architect in Data Center Solutions Group (DCSG) division at Broadcom Inc. Moshe spearheaded the system architecture development of ROCE and Congestion Control in Broadcom Ethernet NICs, involved in definition of product architecture, modeling, and system simulations. Previously Moshe was a Director, manager, ASIC/HW engineer at Cisco High End router division where he developed and managed the development of Network Processing Unit (NPU), QOS, and fabric ASICs, in products such as GSR and CRS.

2:00 - 2:30

Upcoming MVAPICH2 Design Enhancements on the Rockport Switchless Network

Matthew (Matt) Williams, Chief Technical Officer, Rockport Networks

Abstract

This talk will focus on the upcoming performance enhancements available in the MVAPICH2 Rockport library including a discussion on advances in multiple queue pairs, and intelligent applications of ultra-high priority QoS. We will also review the Rockport supercomputer network fabric and initial benchmark performance and scalability results using OSU Micro-Benchmarks on both unloaded and loaded networks.

Bio

Matthew Williams is CTO of Rockport Networks and has 25 years of technical leadership and engineering experience, 14 years as CTO of successful network technology companies and has 21 issued US patents. He is an expert strategist, analyst and visionary who has delivered on transformational product concepts. Matthew is an insightful and energetic communicator who enjoys product evangelization and inspiring global business and technical audiences. Matthew has a B.Sc. in Electrical Engineering with First Class Honours from Queen's University, Kingston, Canada and is a registered P.Eng.

2:30 - 3:00

MVAPICH2 on Microsoft Azure HPC and AI Clusters

Jithin Jose, Microsoft Azure

PDF

YouTube

Abstract

Recent technology advancements have substantially improved the performance potential of virtualization. As a result, the performance gap between bare-metal and cloud clusters are continuing to shrink. This is quite evident as public clouds such as Microsoft Azure has climbed up into the top 20 and 30 rankings in Graph500 and Top500 list, respectively. Moreover, public clouds democratize these technology advancements with focus on performance, scalability, and cost-efficiency. Though the platform technologies and features are continuing to evolve, middlewares such as MPI libraries play a key role in enabling applications to first make use of the technology advancements, and with high performance. This talk focuses on how MVAPICH2 efficiently enables the latest technology advancements in Azure HPC and AI clusters. This talk will also provide an overview of the latest HPC and AI offerings in Microsoft Azure HPC along with their performance characteristics. It will cover the Microsoft Azure HPC marketplace images that include MVAPICH2-Azure MPI libraries, as well as recommendations and best practices for using MVAPICH2 and MVAPICH2-X on Microsoft Azure. We will also discuss the performance and scalability characteristics using microbenchmark and HPC applications.

Bio

Dr. Jithin Jose is a Principal Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and performance optimizations. His research interests include high performance interconnects and protocols, parallel programming models, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

3:00 - 3:30

Service options and hardware choices for HPC on AWS

Matthew Koop, AWS

PDF

YouTube

Abstract

There are a lot of hardware and software choices for answering your research questions. This talk covers some of the services, hardware choices, and underlying technologies that enable HPC on AWS, with examples of real-life workloads and benchmarks.

Bio

Matt is a Principal Solutions Architect for Compute and HPC at AWS. He draws on a broad set of experience in large-scale computing from both commercial and public sector to work with customers on their large-scale compute requirements on AWS. Matt holds a Ph.D. in Computer Science and Engineering from the Ohio State University.

3:30 - 4:00

Performance Engineering using MVAPICH and TAU

Sameer Shende, Paratools and University of Oregon

PDF

YouTube

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. TAU's support for GPUs including CUDA, DPC++/SYCL, OpenCL, OpenACC, Kokkos, and HIP/ROCm improve performance evaluation of heterogenous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer and its usage with MVAPICH2 under Amazon AWS using the Extreme-scale Scientific Software Stack (E4S) AWS image. http://tau.uoregon.edu

Bio

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as a Research Associate Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

4:00 - 4:30

Overview of the NSF-Funded AI Institute (ICICLE)

Vipin Chaudhary, Case Western Reserve University (CWRU)

Amitava Majumdar, San Diego Supercomputer Center (SDSC)

Dhabaleswar K (DK) Panda, The Ohio State University

Joe Stubbs, Texas Advanced Computing Center (TACC)

PDF

YouTube

Abstract

A short presentation on the newly funded AI Institute (ICICLE) will take place, followed by an open discussion on potential collaboration opportunities.

Bio

A veteran of High-Performance Computing (HPC), Dr. Chaudhary has been actively participating in the science, business, government, and technology innovation frontiers of HPC for almost three decades. His contributions range from heading research laboratories and holding executive management positions, to starting new technology ventures. Most recently, he was a Program Director at the National Science Foundation where he was involved in many national initiatives and the Empire Innovation Professor of Computer Science and Engineering at SUNY Buffalo. He cofounded Scalable Informatics, a leading provider of pragmatic, high performance software-defined storage and compute solutions to a wide range of markets, from financial and scientific computing to research and big data analytics. From 2010 to 2013, Dr. Chaudhary was the Chief Executive Officer of Computational Research Laboratories (CRL), a wholly owned Tata Sons company, where he grew the company globally to be an HPC cloud and solutions leader before selling it to Tata Consulting Services. Prior to this, as Senior Director of Advanced Development at Cradle Technologies, Inc., he was responsible for advanced programming tools for multi-processor chips. He was also the Chief Architect at Corio Inc., which had a successful IPO in July, 2000. Dr. Chaudhary was awarded the prestigious President of India Gold Medal in 1986 for securing the first rank amongst graduating students at the Indian Institute of Technology (IIT). He received the B.Tech. (Hons.) degree in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur, in 1986 and a Ph.D. degree from The University of Texas at Austin in 1992.

Amit Majumdar is the Division Director of the Data Enabled Scientific Computing division at the San Diego Supercomputer Center and Associate Professor in the Department of Radiation Medicine and Applied Sciences at the University of California San Diego. His research interests are in high performance computing, computational science, cyberinfrastructure and science gateways. He is interested in convergence of HPC and data science. He is PI/Co-PI on multiple research projects related to HPC and AI machines/programming, neuroscience cyberinfrastructure, neuromorphic computing and education/outreach and which are funded by NSF, NIH, DOD and industry. He received bachelor’s in electronics and telecommunication engineering from the Jadavpur University, Calcutta; master's in nuclear engineering from the Idaho State University, Pocatello; Ph.D. degree in the interdisciplinary program of nuclear engineering and scientific computing from the University of Michigan, Ann Arbor. He is member of IEEE, SIAM, APS, Society for Neuroscience (SfN), and Organization for Computational Neuroscience (OCNS).

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 500 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, RoCE, EFA, and Rockport Networks) libraries, designed and developed by his research group (mvapich.cse.ohio-state.edu), are currently being used by more than 3,200 organizations worldwide (in 89 countries). More than 1.4 million downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 4th, 10th, 20th and 31st ranked ones) in the TOP500 list. High-performance and scalable solutions for deep learning and machine learning from his group are available from hidl.cse.ohio-state.edu. High-performance and scalable libraries for Big Data stacks (Spark, Hadoop and Memcached) and Data science applications from his group (hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 340 organizations in 38 countries. More than 40,000 downloads of these libraries have taken place. He is an IEEE Fellow. More details about Prof. Panda are available at cse.ohio-state.edu/~panda.

Joe leads the Cloud and Interactive Computing (CIC) group, which focuses on building cloud native applications and infrastructure for computational science. CIC develops, deploys and maintains multiple national-scale clouds as part of ongoing projects funded by the National Science Foundation. Additionally, the CIC group contributes to and deploys multiple cloud platforms-as-a-service for computational science including the Agave science-as-a-service platform, TACC's custom JupyterHub, and Abaco: Functions-as-a-service via the Actor model and Linux containers. These platforms are leveraged by numerous cyberinfrastructure projects used by tens of thousands of investigators across various domains of science and engineering. Prior to joining the University of Texas, Joe received a B.S. in Mathematics from the University of Texas, Austin and a Ph.D. in Mathematics from the University of Michigan. His recent interests include distributed systems, container technologies and interactive scientific computing.

4:30 - 5:30

Short Presentations

Performance of ROCm-aware MVAPICH2-GDR on LLNL Corona Cluster with AMD GPUs, Kawthar Shafie Khorassani, The Ohio State University

PDF

Benefits of On-the-Fly Compression on GPU-to-GPU Communication for HPC and Data Science Applications, Qinghua Zhou, The Ohio State University

PDF

Performance of MVAPICH2-GDR on DGX A100, Chen-Chun Chen, The Ohio State University

PDF

Performance Studies of MVAPICH2 Libraries on AWS and Oracle Clouds, Shulei Xu, The Ohio State University

PDF

Benefits of Streaming Aggregation with SHARPv2 in MVAPICH2, Bharath Ramesh, The Ohio State University

PDF

Optimizing Communication Performance of Derived Datatypes, Kaushik Kandadi Suresh, The Ohio State University

PDF

Wednesday, August 25

10:00 - 11:00

Keynote: Scaling Out High Performance Computing

Luiz DeRose, Oracle

PDF

YouTube

Abstract

The continuous increase in complexity and scale of high-end systems, with denser nodes, longer vectors, more complex memory hierarchies, and heterogeneous processing elements, together with the evolving diversity of processor options, are forcing computational scientists to face system characteristics that can significantly impact the performance and scalability of applications. HPC users need a system infrastructure that can adapt to their workload needs, rather than having to constantly redesign their applications to adapt to new systems. In this talk, I will discuss the current trends in computer architecture and the implications in the development of HPC applications and programming and middleware environments. I will present the Oracle Cloud Infrastructure (OCI), which provides availability, resiliency, and performance at scale, so HPC users can easily choose the best option for their workloads, and will discuss hybrid on-prem/cloud options, which facilitate workload migration from on-prem to the cloud. I will finish the presentation with a discussion of some of the challenges and open research problems that still need to be addressed in this area.

Bio

Dr. Luiz DeRose is a Director of Cloud Engineering for HPC at Oracle. Before joining Oracle, He was a Sr. Science Manager at AWS, and a Senior Principal Engineer and the Programming Environments Director at Cray. Dr. DeRose has a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. He has more than 25 years of high-performance computing experience and a deep knowledge of programming and middleware environments for HPC. Dr. DeRose has eight patents and has published more than 50 peer-review articles in scientific journals, conferences, and book chapters, primarily on the topics of compilers and tools for high performance computing.

11:00 - 11:30

Performance of Applications on Nurion and Neuron Utilizing MVAPICH2

Minsik Kim, KISTI

PDF

YouTube

Abstract

KISTI's Nurion supercomputer features 8305 nodes with Intel Xeon Phi KNL (Knight Landing) processors (68 cores) and 132 nodes with Intel Skylake CPUs (2-socket, 40 cores). Nurion is a system consisting of compute nodes, CPU-only nodes, Omni-Path interconnect networks, burst buffer high-speed storage, Luster-based parallel file system. Also, KISTI’s Neuron supercomputer features 78 nodes with NVIDIA GPUs to support GPU computing and design the KISTI’s next supercomputer. We will present microbenchmark and application performance results using MVAPICH2 on Nurion and Neuron.

Bio

Minsik Kim is a senior researcher in the Supercomputing Infrastructure Center of the Korea Institute of Science and Technology Information (KISTI). He received the Ph.D. degree in Electrical and Electronic Engineering from Yonsei University in 2019. His research interests include deep learning optimization on GPUs, computer architecture, and high-performance computing. He is a member of IEEE. More details about Dr. Kim is available at minsik-kim.github.io

11:30 - 12:00

On the Energy Efficiency of MPI Intra-node Communication

Hyun-Wook Jin, Konkuk University

PDF

YouTube

Abstract

The progress engine in the MPI library recognizes changes in communication states, such as message arrivals, by polling. Although polling provides a low communication latency, its use results in low energy efficiency because the progress engine occupies CPU resources while performing polling. The decrease in energy efficiency induced by polling has become more severe as the skew has increased with the advent of exascale systems. In this talk, we describe a progress engine that uses both polling and signaling to perform energy-efficient intra-node communication. There have been studies on energy efficient MPI; however, existing studies do not significantly consider the intra-node communication channels that use shared memory buffers. We present that our preliminary implementation of signaling-based progress engine based on MVAPICH2 improves energy efficiency as the skew increases on a many-core system.

Bio

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory (SSLab) at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on operating systems for high-end computing systems and cyber-physical systems.

12:00 - 12:30

Accelerating the communication cost on parallel 3-D FFTs.

Stan Tomov and Alan Ayala, The University of Tennessee, Knoxville

PDF

YouTube

Abstract

The well-known communication bottleneck on parallel Fast Fourier Transforms (FFT) has been studied for several authors, and its significance has increased with the introduction of GPU accelerators. Given that FFTs are a critical dependency for various applications at exascale, it is critical to develop new communication schemes and tune between its (software and hardware) dependencies to ensure scalability. In this talk we present advancements on communication management for distributed FFT using the implementation provided on heFFTe library (an ECP project), which has support for AMD, Intel and NVIDIA GPUs.

Bio

Stan Tomov received a M.S. degree in Computer Science from Sofia University, Bulgaria, and Ph.D. in Mathematics from Texas A&M University. He is a Research Director in ICL and Research Assistant Professor in the EECS at UTK. Tomov's research interests are in parallel algorithms, numerical analysis, and high performance scientific computing (HPC). Currently, his work is concentrated on the development of numerical linear algebra software, and in particular MAGMA, for emerging architectures for HPC, and heFFTe for FFT distributed computations.

Alan Ayala received a M.S. degree in Applied Mathematics from Pierre et Marie Université, and a PhD. from Sorbonne Université and Inria-Paris. He is a research associate at the Innovative Computing Laboratory (ICL) at the University of Tennessee in Knoxville. Currently, Dr. Ayala's research focuses on the development of heFFTe library for FFT computation on upcoming exascale systems, and the FFT benchmarking software initiative.

12:30 - 1:00

Experiences with MVAPICH2 Deployment on SDSC's Expanse Supercomputer

Mahidhar Tatineni, San Diego Supercomputer Center

PDF

YouTube

Abstract

Expanse is a newly deployed system at SDSC which went into full production in December 2020. The machine is targeted at long-tail workloads with both CPU and GPU based nodes. Expanse is one of the first NSF funded (award# OAC 1928224) XSEDE systems to deploy AMD’s 7nm EPYC processors with 128 cores/node. The GPU nodes will incorporate NVIDIA V100 GPUs supporting a broad range of applications in molecular dynamics, bioinformatics, and machine learning/deep learning. This talk will discuss experiences with the MVAPICH2 deployments on Expanse. The MVAPICH2 team has made extensive contributions to developing an efficient MPI implementation on the new architecture. The talk will discuss applications performance results and gains obtained with the newest MVAPICH2 releases.

Bio

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the support of high performance computing and data applications software on several NSF and UC resources including Expanse, Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems. He is co-PI on the Expanse HPC system and the National Research Platform projects at SDSC.

1:00 - 1:30

Break

1:30 - 2:00

Some updates on binary cache handling in Spack

Todd Gamblin, Lawrence Livermore National Laboratory

PDF

YouTube

Abstract

Some updates on binary cache handling in Spack Abstract: Spack has had the ability to host binary caches for a long time, but recently their usage has increased dramatically. MVAPICH2 GDR packages, the ECP E4S distribution, and even the Clingo solver package that Spack uses internally are shipped as binaries, and we would like to move towards providing more packages as binaries by default. This talk will discuss some of the challenges of working with binaries vs. source builds, and some recent developments in Spack that should help make binary deployment faster and more flexible.

Bio

Todd Gamblin is a Senior Principal MTS in Livermore Computing's Advanced Technology Office at Lawrence Livermore National Laboratory. He created Spack, a popular open source HPC package management tool with a rapidly growing community of contributors. He leads the Packaging Technologies Project in the U.S. Exascale Computing Project, LLNL's DevRAMP project on developer productivity, and an LLNL Strategic Initiative on software integration and dependency management. His research interests include dependency management, software engineering, parallel computing, performance measurement, and performance analysis.

2:00 - 2:20

Integration MVAPICH2 into RedHat Enterprise Linux

Honggang Li, Redhat

PDF

YouTube

Abstract

The talk will focus on the experience of integration of MVAPICH2 into RHEL, the type of tests we carry out, and issues for regression issue debugging and MVAPICH2 update for major release of RHEL.

Bio

Honggang Li is a software engineer in RedHat. He is the maintainer of user-space packages of RDMA stack for Fedora and RedHat Enterprise Linux.

2:20 - 2:40

Cyberinfrastructure Research, Learning and Workforce Development (LWD) Programs

Alan Sussman, NSF

PDF

YouTube

Abstract

The talk will describe research/development and learning/workforce development (LWD) programs within the Office of Advanced Cyberinfrastructure (OAC) in the CISE directorate at the National Science Foundation. OAC's mission is to support advanced cyberinfrastructure to accelerate discovery and innovation across all science and engineering disciplines. The programs specifically addressed include the CAREER program for faculty early career development, the CISE Research Initiation Initiative (CRII) for early career faculty who have not yet been a PI on a Federal grant, the Cybertraining program for research workforce preparation, the OAC Core Research Program that is part of the CISE Core Research programs solicitation, and the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program for creating software and data CI products and services.

Bio

Alan Sussman is a program director in the Office of Advanced Cyberinfrastructure at the National Science Foundation in charge of learning and workforce development programs, and is also active in software and data related cyberinfrastructure programs. He is on leave from his permanent position as a Professor of Computer Science at the University of Maryland. His research interests focus on systems software support for large-scale applications that require high performance parallel and distributed computing. Working with students and other collaborators at Maryland and other institutions he has published numerous conference and journal papers and received several best paper awards in various topics related to software tools for high performance parallel and distributed computing.

2:40 - 3:15

Accelerating HPC and DL applications using DPUs and efficient checkpointing

Donglai Dai, X-ScaleSolutions

PDF

YouTube

Abstract

This talk will present an overview of two products with enhanced capabilities by X-ScaleSolutions. The products are: 1) MVAPICH2-DPU communication library using NVIDIA Bluefield DPUs and 2) SCR-Exa checkpointing-restart library for HPC and Deep Learning applications. The MVAPICH2-DPU library takes advantage of the features to offload communication components in the MPI library and deliver best-in-class scale-up and scale-out performance for HPC and DL applications. It integrates key components enabling full computation and communication overlap, especially with non-blocking collectives. The SCR-Exa product enhances the existing open-source SCR library with: i) significantly increased portability and flexibility for diverse job launchers, resource managers, storage devices with a variety of underlying protocols; ii) new capabilities to launch applications with spare nodes for fast and efficient restart and resume; iii) new python interface and internal core for ease-of-use and improved maintainability and extensibility. This talk will present an overview of the software architectures of MVAPICH2-DPU and SCR-Exa products, discuss the underlying designs and benefits.

Bio

Dr. Donglai Dai is a Chief Engineer at X-ScaleSolutions and leads company’s R&D team. His current work focuses on developing scalable efficient communication libraries, checkpointing and restart libraries, and performance analysis tools for distributed and parallel HPC and deep learning applications on HPC systems and clouds. He has more than 20 years of industry experience in engineering management and development of computer systems, VLSI, IoT, and interconnection networks while working at Intel, Cray, SGI, and startups. He holds more than 10 granted US patents and has published more than 30 technical papers or book chapters. He has a PhD degree in computer science from The Ohio State University.

3:15 - 4:00

Overview of OSU INAM Deployment at Ohio Supercomputer Center and Live Demo

Karen Tomko, Heechang Na and Pouya Kousha, OSC

PDF

YouTube

Abstract

The OSU InfiniBand Network Analysis and Monitoring (OSU INAM) tool has been running on OSC’s production systems for more than two years. In this talk, we’ll give an overview of OSC’s HPC environment and IB fabric and discuss our INAM deployment. We'll include a discussion of OSC’s INAM configuration, support for our job scheduling environment, and experience using INAM to understand communication characteristics for HPC jobs. We’ll follow with a short demo of INAM at OSC and OSU clusters.

Bio

Karen Tomko is the Director of Research Software Applications and serves as manager of the Scientific Applications group at the Ohio Supercomputer Center where she oversees deployment of software for data analytics, modeling and simulation. Her research interests are in the field of parallelization and performance improvement for High Performance Computing applications. She has been with OSC since 2007 and has been collaborating with DK Panda and the MVAPICH team for about 10 years.

Heechang Na is a Senior Scientific Applications Engineer at the Ohio Supercomputer Center. He is interested in performance analysis, system monitoring, and research environment development. He received his Ph.D. in computational high-energy physics from Indiana University, Bloomington. Before joining the Ohio Supercomputer Center in 2015, he worked with MILC and HPQCD collaborations in Lattice QCD.

Pouya Kousha is a fifth year PhD at the Ohio State University, supervised by Prof. DK. Panda. His research interests are Parallel Algorithms and Distributed Systems, High Performance Computing, and Real-time Scalable Profiling Tools. His work primary focused on scalable online analysis and profiling tools to discover and solve performance bottlenecks and optimize MPI library and applications.