MUG'19

(Preliminary Schedule)

Conference Location: Ohio Supercomputer Center Bale Theater

Monday, August 19

8:30 - 9:00

Registration and Continental Breakfast

Abstract

This hands-on tutorial will jumpstart developers with Arm's performance engineering tools for MVAPICH2 and Arm's Scalable Vector Extension (SVE). Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. We will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. We will also explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability. We will also introduce Arm's tools for performance investigation of MPI programs that use Arm's Scalable Vector Extension (SVE) and demonstrate how these tools may be used with MVAPICH2.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

10:30 - 11:00

Break

Abstract

This hands-on tutorial will jumpstart developers with Arm's performance engineering tools for MVAPICH2 and Arm's Scalable Vector Extension (SVE). Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. We will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. We will also explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability. We will also introduce Arm's tools for performance investigation of MPI programs that use Arm's Scalable Vector Extension (SVE) and demonstrate how these tools may be used with MVAPICH2.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

12:30 - 1:30

Lunch

Abstract

This lecture will help you understand what FPGA hardware acceleration provides and when it can be used in complement or replacement of GPU. SNAP framework provides software engineers with a mean to use this technology in a snap ! The unique advantages of POWER technology including CAPI / OpenCAPI technology coupled to SNAP framework will be presented. Details of what memory coherency and low latency associated to FPGA bring to you will be explored through very simple examples.


Bio

Alexandre Castellane

3:00 - 3:30

Break

Abstract

This lecture will help you understand what FPGA hardware acceleration provides and when it can be used in complement or replacement of GPU. SNAP framework provides software engineers with a mean to use this technology in a snap ! The unique advantages of POWER technology including CAPI / OpenCAPI technology coupled to SNAP framework will be presented. Details of what memory coherency and low latency associated to FPGA bring to you will be explored through very simple examples.


Bio

Alexandre Castellane

6:00 - 9:30

Reception Dinner

Tuesday, August 20

8:45 - 8:30

Registration and Continental Breakfast

10:30 - 11:00

Break

Abstract

In-Network Computing transforms the data center interconnect to become a "distributed CPU", and "distributed memory", enables to overcome performance barriers and to enable faster and more scalable data analysis. HDR 200G InfiniBand In-Network Computing technology includes several elements - Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), smart Tag Matching and rendezvoused protocol, and more. These technologies are in use at some of the recent large scale supercomputers around the world, including the top TOP500 platforms. The session will discuss the InfiniBand In-Network Computing technology and performance results, as well as view to future roadmap.


Bio

Gilad Shainer

Gilad Shainer has serves as Mellanox's senior vice president of marketing since March 2019. Previously, Mr. Shainer was Mellanox's vice president of marketing from March 2013 to March 2019, and vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization and the president of the UCF consortium. Mr. Shainer holds multiple patents in the field of high-speed networking. He is also a recipient of 2015 R&D100 award for his contribution to the CORE-Direct collective offload technology. Gilad Shainer holds MSc degree and BSc degree in Electrical Engineering from the Technion Institute.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. TAU's support for GPUs including CUDA, OpenCL, OpenACC, Kokkos, and ROCm improve performance evaluation of heterogenous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

12:45 - 1:30

Lunch

Abstract

'HPC in the Cloud' has always been a dream of HPC users, as it offers ever-ready, instant scaling compute resources and unlimited storage. Moreover, the ever-growing complexity of resource-hungry applications and their massive data requirements continue to compel a natural embracement of the cloud for HPC, Big Data and Deep Learning workloads. However, performance concerns in a cloud environment have traditionally discouraged adoption for HPC workloads. This talk focuses on how HPC offerings in Azure address these challenges and explains the design pillars that allow Microsoft to offer "bare-metal performance and scalability" on the Microsoft Azure Cloud. This talk also covers the features of latest Microsoft Azure HPC offerings and provides in-depth performance insights and recommendations for using MVAPICH2 and MVAPICH2-X on Microsoft Azure. Finally, we will also demonstrate how to quickly deploy an MVAPICH2 powered cluster on the Microsoft Azure HPC offerings.


Bio

Jithin Jose

Dr. Jithin Jose is a Senior Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and designing communication runtimes that seamlessly expose hardware capabilities to programming models and middleware. His research interests include high performance interconnects and protocols, parallel programming models, virtualization, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

Abstract

Checkpointing is the ability to save the state of a running process to stable storage, and later restarting that process from the point at which it was checkpointed. Transparent checkpointing (also known as system-level checkpointing) refers to the ability to checkpoint a (possibly MPI-parallel or distributed) application, without modifying the binaries of that target application. Traditional wisdom has assumed that the transparent checkpointing approach has some natural restrictions. Examples of long-held restrictions are: (i) the need for a separate network-aware checkpoint-restart module for each network that will be targeted (e.g., one for TCP, one for InfiniBand, one for Intel Omni-Path, etc.); (ii) the impossibility of transparently checkpointing a CUDA-based GPU application that uses NVIDIA UVM (UVM is "unified virtual memory", which allows the host CPU and the GPU device to each access the same virtual address space at the same time.); and (iii) the impossibility of transparently checkpointing an MPI application that was compiled for one MPI library implementation (e.g., for MPICH or for Open MPI), and then restarting under an MPI implementation with targeted optimizations (e.g., MVAPICH2-X or MVAPICH2-EA). This talk breaks free from the restrictions described above, and presents an efficient, new software architecture: split processes. The "MANA for MPI" software demonstrates this split-process architecture. The MPI application code resides in "upper-half memory", and the MPI/network libraries reside in "lower-half memory". The tight coupling of upper and lower half ensures low runtime overhead. And yet, when restarting from a checkpoint, "MANA for MPI" allows one to choose to replace the original lower half with a different MPI library implementation. This different MPI implementation may offer such specialized features as enhanced intra- and inter-node point-to-point performance and enhanced performance of collective communication (e.g., with MVAPICH2-X); or perhaps better energy awareness (e.g., with MVAPICH2-EA). Further, the new lower half MPI may be optimized to run on different hardware, including a different network interconnect, a different number of CPU cores, a different configuration of ranks-per-node, etc. This makes cross-cluster migration both efficient and practical. This talk represents joint work with Rohan Garg and Gregory Price.


Bio

Gene Cooperman

Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria. He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is at 25,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project. Prof. Cooperman's current interests center on the frontiers of extending transparent checkpointing to new architectures. His work has been applied to VLSI circuit simulators, circuit verification (e.g., by Intel, Mentor Graphics, and others), formalization of mathematics, bioinformatics, network simulators, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and of course high performance computing (HPC).

3:00 - 3:45

Break and Student Poster Session

Abstract

As the number of cores integrated in a modern processor package increases, the scalability of MPI intra-node communication becomes more important. To reduce the intra-node messaging overheads, MVAPICH2 provides the intra-node communication channels, such as shared memory and memory mapping. The memory mapping channel was designed particularly for large messages with kernel-level assistance (i.e., CMA and LiMIC2). In this talk, I will introduce new interfaces of the kernel-level assistance in MVAPICH2 to improve the performance of intra-node collective communications. Our preliminary results show that the new interfaces can reduce the latency of MPI_Bcast() up to 84% on an 120-core machine by perfectly overlapping the data copy operations.


Bio

Hyon-Wook Jin

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory (SSLab) at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on operating systems for high-end computing systems and cyber-physical systems.

4:45 - 5:00

Open MIC Session

6:00 - 9:30

Banquet Dinner

Wednesday, August 21

7:45 - 8:30

Registration and Continental Breakfast

Abstract

The EPFL Blue Brain Project (BBP) has been pushing the boundaries of the size, complexity and biological faithfulness of brain tissue simulations. A data-driven software pipeline is used to digitally reconstruct brain tissue that faithfully reproduces an array of laboratory experiments. To enable this, Blue Brain Project operates a dedicated computing system (BB5) which consists of different computing and storage elements (Intel KNLs, NVIDIA GPUs, Intel CPUs, DDN IME). In this talk we present the role MPI in different software pipelines including circuit building, simulation, 3-D visualization and large scale analysis. We especially focus on how the NEURON simulator is being optimised for large scale simulations using latest compiler technologies and MPI stack (from vendor and MVAPICH2 team).


Bio

Pramond Kumbhar

Pramod Kumbhar is HPC Architect in Computing Division at Blue Brain Project. His focus is on the development of the NEURON/CoreNEURON simulator within the Blue Brain Project. Over the years Pramod has been working on parallelisation, performance optimisation and scaling of scientific codes on various supercomputing architectures. Pramod has strong hands-on experience with variety of performance analysis tools at scale and micro-architecture level performance tuning. He also has a keen interest in domain specific languages (DSL) and modern compiler technologies. Before joining the Blue Brain Project, Pramod worked at the Jülich Research Centre, Germany.

10:30 - 11:00

Break

Abstract

AI Bridging Cloud Infrastructure (ABCI) is the world's first large-scale Open AI Computing Infrastructure, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST), Japan. It delivers 19.9 petaflops of HPL performance and world' fastest training time of 1.17 minutes in ResNet-50 training on ImageNet datasets as of July 2019. ABCI consists of 1,088 compute nodes each of which equipped with two Intel Xeon Gold Scalable Processors, four NVIDIA Tesla V100 GPUs, two InfiniBand EDR HCAs and an NVMe SSD. ABCI offers a sophisticated high performance AI development environment realized by CUDA, Linux containers, on-demand parallel filesystem, MPI, including MVAPICH, etc. In this talk, we focus on ABCI’s network architecture and communication libraries available on ABCI and shows their performance and recent research achievements.


Bio

Shinichiro Takizawa

Shinichiro Takizawa, Ph.D is a senior research scientist of AI Cloud Research Team, AI Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Japan. His research interests are data processing and resource management on large-scale parallel systems. He also works as a member of AI Bridging Cloud Infrastructure (ABCI) operation team and designs future ABCI services. Shinichiro Takizawa received Ph.D in Science from Tokyo Institute of Technology in 2009.

12:15 - 1:15

Lunch

Abstract

KISTIi's Nurion supercomputer features 8305 nodes with Intel Xeon Phi KNL (Knight Landing) processors (68 cores) and 132 nodes with Intel Skylake CPUs (2-socket, 40 cores). Nurion is a system consisting of compute nodes, CPU-only nodes, Omni-Path interconnect networks, burst buffer high-speed storage, Luster-based parallel file system, and water cooling device based on Rear Door Heat Exchanger (RDHx). We will present microbenchmark and application performance results using MVAPICH on the KNL nodes.


Bio

Minsik Kim

Minsik Kim is a researcher in the Supercomputing Infrastructure Center of the Korea Institute of Science and Technology Information (KISTI). He received the Ph.D. degree in Electrical and Electronic Engineering from Yonsei University in 2019. His research interests include neural network optimization on GPUs, computer architecture, and high-performance computing. He is a member of IEEE. More details about Dr. Kim is available at http://minsik-kim.github.io.

Abstract

SDSC supports HPC and Deep Learning applications on systems featuring K80, P100, and V100 GPUs. On the NSF funded Comet cluster there are primarily two types of GPU nodes:1) 36 nodes with Intel Haswell CPUs (2-socket, 24 cores) with 4 NVIDIA K-80 GPUs (two accelerator cards) each, and 2) 36 nodes with Intel Broadwell CPUs (2-socket, 28 cores) with 4 NVIDIA P100 GPUs on each. Additionally one node with 4 V100 GPUs is available for benchmarking and testing. Some of the deep learning applications are supported via the Singularity containerization solution. Application testing and performance results using MVAPICH2-GDR using the various types of nodes and containerization will be presented.


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems.

2:40 - 3:15

Break

Abstract

The OSU InfiniBand Network Analysis and Monitoring (OSU INAM) tool has been running on OSC’s production systems for several months. In this talk we’ll give an overview of OSC’s HPC environment, IB fabric and our INAM deployment. It will include a discussion of OSC’s INAM configuration and improvements to the scalability of fabric discovery and optimization of database insertion/query rates resulting from our deployment. We'll also discuss integration with OSC's Torque/MOAB resource management and early experiences in analysis of job communication characteristics. At the end of the talk we'll give a short demo of INAM at OSC.


Bio

Karen Tomko and Heechang Na

Karen Tomko is the Director of Research Software Applications and serves as manager of the Scientific Applications group at the Ohio Supercomputer Center where she oversees deployment of software for data analytics, modeling and simulation. Her research interests are in the field of parallelization and performance improvement for High Performance Computing applications. She has been with OSC since 2007 and has been collaborating with DK Panda and the MVAPICH team for about 10 years.