MUG'19

(Final Program)

Conference Location: Ohio Supercomputer Center Bale Theater

Monday, August 19

8:30 - 9:00

Registration and Continental Breakfast

8:30 - 8:30

Shuttle Service to Conference Center (by the hotel)

Abstract

This hands-on tutorial will jumpstart developers with Arm's performance engineering tools for MVAPICH2 and Arm's Scalable Vector Extension (SVE). Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. We will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. We will also explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability. We will also introduce Arm's tools for performance investigation of MPI programs that use Arm's Scalable Vector Extension (SVE) and demonstrate how these tools may be used with MVAPICH2.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

10:30 - 11:00

Break

Abstract

This hands-on tutorial will jumpstart developers with Arm's performance engineering tools for MVAPICH2 and Arm's Scalable Vector Extension (SVE). Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. We will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. We will also explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability. We will also introduce Arm's tools for performance investigation of MPI programs that use Arm's Scalable Vector Extension (SVE) and demonstrate how these tools may be used with MVAPICH2.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

Abstract

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors intelligent network devices, which manipulate data traversing the data-center network, SHARP technology designed to offload collective operation processing to the network. This talk will provide an overview of SHARP technology, new features including high bandwidth Streaming Aggregation, live example of accelerating MPI and DL use cases using SHARP.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHAPP, HCOLL ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda, involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master’s degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

12:30 - 1:30

Lunch

Abstract

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. As a step towards a new network class of co-processors intelligent network devices, which manipulate data traversing the data-center network, SHARP technology designed to offload collective operation processing to the network. This talk will provide an overview of SHARP technology, new features including high bandwidth Streaming Aggregation, live example of accelerating MPI and DL use cases using SHARP.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHAPP, HCOLL ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda, involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master’s degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for GPU-enabled clusters in MVAPICH2-GDR and many-core systems in MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications from HPC and AI to demonstrate how one can effectively take advantage of MVAPICH2 in HPC and AI applications using MPI and CUDA/OpenACC will also be presented.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

3:00 - 3:30

Break

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for GPU-enabled clusters in MVAPICH2-GDR and many-core systems in MVAPICH2-X will be presented. The impact on the performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to example applications from HPC and AI to demonstrate how one can effectively take advantage of MVAPICH2 in HPC and AI applications using MPI and CUDA/OpenACC will also be presented.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. Dr. Subramoni is doing research on the design and development of MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI and PGAS (OpenSHMEM, UPC and CAF)) software packages. He is a member of IEEE. More details about Dr. Subramoni are available at http://web.cse.ohio-state.edu/~subramoni.1/.

5:00 - 5:00

Shuttle Service to SpringHill Suites (by the hotel)

6:00 - 9:30

Reception Dinner at Brazenhead

1027 W 5th Ave

Columbus, OH 43212

Tuesday, August 20

7:45 - 8:30

Registration and Continental Breakfast

8:00 - 8:00

Shuttle Service to Conference Center (by the hotel)

8:30 - 8:40

Opening Remarks

Dave Hudak, Executive Director, Ohio Supercomputer Center
Dhabaleswar K (DK) Panda, The Ohio State University

Abstract

Last summer, the Texas Advanced Computing Center (TACC) at the University of Texas at Austin was selected as the sole awardee of the National Science Foundation’s “Towards a Leadership Class Computing Facility” solicitation. The resulting machine, Frontera, became the #5 system in the world in the Top 500 list. In this talk, I will describe the main components of the award: the Phase 1 system, “Frontera”, the plans for facility operations and scientific support for the next five years, and the plans to design a Phase 2 system in the mid-2020s to be the NSF Leadership system for the latter half of the decade, with capabilities 10x beyond Frontera. The talk will also discuss the key role MVAPICH and Infiniband play in the project, and why the workload for HPC still can't fit effectively on the cloud without advanced networking support.


Bio

Dan Stanzione

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin since 2018 and Executive Director of the Texas Advanced Computing Center (TACC) since 2014, is a nationally recognized leader in high performance computing. He is the principal investigator (PI) for a National Science Foundation (NSF) grant to acquire and deploy Frontera, which will be the fastest supercomputer at any U.S. university. Stanzione is also the PI of TACC's Stampede2 and Wrangler systems, supercomputers for high performance computing and for data-focused applications, respectively. For six years he was co-PI of CyVerse, a large-scale NSF life sciences cyberinfrastructure. Stanzione was also a co-PI for TACC's Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X and MVAPICH2-GDR) for HPC and Deep Learning will be presented. Features and releases for Microsoft Azure and Amazon AWS will also be presented. Current status and future plans for OSU INAM, OMB, and Best Practices Page will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 3,000 organizations worldwide (in 89 countries). More than 555,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 5th, 8th, 12th, 15th, 16th and 19th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 315 organizations in 35 countries. More than 30,600 downloads of these libraries have taken place. He is an IEEE Fellow. The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. A high-performance and scalable version of the Caffe framework is available from High-Performance Deep Learning (HiDL) Project site. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

10:30 - 11:00

Break

Abstract

The Exascale era is upon us. In a few short years, multiple Exascale systems will be up and doing real science. At such scales, two aspects of the system are critical: scalable communication and efficient integration with compute accelerators. The Open Fabrics (OFI) was envisioned and created in order to provide applications high-level application oriented communication semantics. In order to encompass collective offloads and compute accelerators the interfaces are being expanded. In this talk, we will highlight recent advances in Libfabrics and the usage of the new APIs.


Bio

Sayantan Sur

Sayantan Sur is a HPC Software Architect at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.

Abstract

In-Network Computing transforms the data center interconnect to become a "distributed CPU", and "distributed memory", enables to overcome performance barriers and to enable faster and more scalable data analysis. HDR 200G InfiniBand In-Network Computing technology includes several elements - Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), smart Tag Matching and rendezvoused protocol, and more. These technologies are in use at some of the recent large scale supercomputers around the world, including the top TOP500 platforms. The session will discuss the InfiniBand In-Network Computing technology and performance results, as well as view to future roadmap.


Bio

Gilad Shainer

Gilad Shainer has serves as Mellanox's senior vice president of marketing since March 2019. Previously, Mr. Shainer was Mellanox's vice president of marketing from March 2013 to March 2019, and vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization and the president of the UCF consortium. Mr. Shainer holds multiple patents in the field of high-speed networking. He is also a recipient of 2015 R&D100 award for his contribution to the CORE-Direct collective offload technology. Gilad Shainer holds MSc degree and BSc degree in Electrical Engineering from the Technion Institute.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's support for the key MVAPICH features including its support for the MPI Tools (MPI_T) interface with support for setting MPI_T control variables on a per MPI communicator basis. TAU's support for GPUs including CUDA, OpenCL, OpenACC, Kokkos, and ROCm improve performance evaluation of heterogenous programming models. It will also describe TAU's support for MPI's performance and control variables exported by MVAPICH, and its support for instrumentation of OpenMP runtime, and APIs for instrumentation of Python programs. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

12:30 - 12:45

Group Photo

12:45 - 1:30

Lunch

Abstract

'HPC in the Cloud' has always been a dream of HPC users, as it offers ever-ready, instant scaling compute resources and unlimited storage. Moreover, the ever-growing complexity of resource-hungry applications and their massive data requirements continue to compel a natural embracement of the cloud for HPC, Big Data and Deep Learning workloads. However, performance concerns in a cloud environment have traditionally discouraged adoption for HPC workloads. This talk focuses on how HPC offerings in Azure address these challenges and explains the design pillars that allow Microsoft to offer "bare-metal performance and scalability" on the Microsoft Azure Cloud. This talk also covers the features of latest Microsoft Azure HPC offerings and provides in-depth performance insights and recommendations for using MVAPICH2 and MVAPICH2-X on Microsoft Azure. Finally, we will also demonstrate how to quickly deploy an MVAPICH2 powered cluster on the Microsoft Azure HPC offerings.


Bio

Jithin Jose

Dr. Jithin Jose is a Senior Software Engineer at Microsoft. His work is focused on co-design of software and hardware building blocks for high performance computing platform, and designing communication runtimes that seamlessly expose hardware capabilities to programming models and middleware. His research interests include high performance interconnects and protocols, parallel programming models, virtualization, big data and cloud computing. Before joining Microsoft, he worked at Intel and IBM Research. He has published more than 25 papers in major conferences and journals related to these research areas. Dr. Jose received his Ph.D. degree from The Ohio State University in 2014.

Abstract

Driven by an exponential increase and availability in volume and diversity of data, Artificial Intelligence (AI) specifically Deep learning (DL) is transforming many businesses around the globe by enabling them to drive operational efficiencies and build new products and services. AI has also begun to reshape the landscape of scientific computing and enabling scientists to address large problems in ways that were not possible before. Intel collaborates with customers and partners worldwide to build, accelerate, scale and deploy their AI applications on Intel based HPC platforms. We share with you our insights on several customer AI use cases we have enabled, the orders of magnitude performance acceleration we have delivered via popular open-source software framework optimizations, and the best-known methods to advance the convergence of AI and High Performance Computing on Intel® XeonT Scalable Processor based servers. We will also demonstrate how large memory systems help real world AI applications efficiently.


Bio

Vikram Saletore

Vikram Saletore is a Principal Engineer, Sr. IEEE Member, and Technical Manager focused on Deep Learning (DL) performance optimizations and acceleration for HPC-AI convergence. He collaborates with industry, Enterprise/Government, HPC, & OEM customers on DL Training and Inference. Vikram is also a Co-PI for DL research with a customers; SURFsara B.V., CERN OpenLabs, Max Planck, Novartis, GENCI/CINES/INRIA, others. Vikram has 25+ years of experience and has delivered optimized software to Oracle, Informix, and completed technical readiness for Intel's 3D-XPoint memory via performance modeling. As a Research Scientist with Intel Labs, he led collaboration with HP Labs, Palo Alto for TCP acceleration. Prior to Intel, as a tenure-track teaching faculty in Computer Science at Oregon State University, Corvallis, OR, Vikram led NSF funded research in parallel programming and distributed computing directly supervising 8 students (PhD, MS). He also developed CPU and network products at DEC and AMD. Vikram received his MS from Berkeley & PhD in EE in Parallel & Distributed Computing from University of Illinois at Urbana-Champaign. He holds multiple patents, 3 pending in DL, ~60 research papers and ~45 white papers, blogs specifically in AI, Machine Learning Analytics, and Deep Learning.

Abstract

Checkpointing is the ability to save the state of a running process to stable storage, and later restarting that process from the point at which it was checkpointed. Transparent checkpointing (also known as system-level checkpointing) refers to the ability to checkpoint a (possibly MPI-parallel or distributed) application, without modifying the binaries of that target application. Traditional wisdom has assumed that the transparent checkpointing approach has some natural restrictions. Examples of long-held restrictions are: (i) the need for a separate network-aware checkpoint-restart module for each network that will be targeted (e.g., one for TCP, one for InfiniBand, one for Intel Omni-Path, etc.); (ii) the impossibility of transparently checkpointing a CUDA-based GPU application that uses NVIDIA UVM (UVM is "unified virtual memory", which allows the host CPU and the GPU device to each access the same virtual address space at the same time.); and (iii) the impossibility of transparently checkpointing an MPI application that was compiled for one MPI library implementation (e.g., for MPICH or for Open MPI), and then restarting under an MPI implementation with targeted optimizations (e.g., MVAPICH2-X or MVAPICH2-EA). This talk breaks free from the restrictions described above, and presents an efficient, new software architecture: split processes. The "MANA for MPI" software demonstrates this split-process architecture. The MPI application code resides in "upper-half memory", and the MPI/network libraries reside in "lower-half memory". The tight coupling of upper and lower half ensures low runtime overhead. And yet, when restarting from a checkpoint, "MANA for MPI" allows one to choose to replace the original lower half with a different MPI library implementation. This different MPI implementation may offer such specialized features as enhanced intra- and inter-node point-to-point performance and enhanced performance of collective communication (e.g., with MVAPICH2-X); or perhaps better energy awareness (e.g., with MVAPICH2-EA). Further, the new lower half MPI may be optimized to run on different hardware, including a different network interconnect, a different number of CPU cores, a different configuration of ranks-per-node, etc. This makes cross-cluster migration both efficient and practical. This talk represents joint work with Rohan Garg and Gregory Price.


Bio

Gene Cooperman

Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from the University of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor there since 1992. His visiting research positions include a 5-year IDEX Chair of Attractivity at the University of Toulouse/CNRS in France, and sabbaticals at Concordia University, at CERN, and at Inria. He is one of the more than 100 co-authors on the foundational Geant4 paper, whose current citation count is at 25,000. The extension of the million-line code of Geant4 to use multi-threading (Geant4-MT) was accomplished in 2014 on the basis of joint work with his PhD student, Xin Dong. Prof. Cooperman currently leads the DMTCP project (Distributed Multi-Threaded CheckPointing) for transparent checkpointing. The project began in 2004, and has benefited from a series of PhD theses. Over 100 refereed publications cite DMTCP as having contributed to their research project. Prof. Cooperman's current interests center on the frontiers of extending transparent checkpointing to new architectures. His work has been applied to VLSI circuit simulators, circuit verification (e.g., by Intel, Mentor Graphics, and others), formalization of mathematics, bioinformatics, network simulators, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and of course high performance computing (HPC).

3:00 - 3:45

Break and Student Poster Session

Programming Model and Architectural Needs for Graph Applications on Continuum Computing Architecture - Bibrak Chandio, Indiana University Bloomington
Using DMTCP to transparently checkpoint statically-linked executables. - Prashant Chouhan, Northeastern University
Towards Scalable Point-to-point Communication in MVAPICH2: Challenges and Solutions - Mahdieh Ghazimirsaeed, The Ohio State University
Designing MVAPICH2 for Scalable Deep Learning with TensorFlow - Arpan Jain, The Ohio State University
HPC storage system robustness - Baolin Li, Northeastern University
Exploration of Graph-Specific Algorithms For IBM Streams - Emin Ozturk, The Ohio State University
File I/O Optimizations for Large Scale Deep Learning - Sarunya Pumma, Virginia Tech
High Performance DWI Processing Pipeline Implementation in LONI Using MPI - Vineet Raichur, Unversity of Michigan
A Plugin Infrastructure for TAU: Enabling Customization and Runtime Control for MPI_T Autotuning - Srinivasan Ramesh, University of Oregon
Support for Advanced Collective Communication in MVAPICH2 MPI library - Amit Ruhela, The Ohio State University
Performance Analysis of MPI Applications with TAU - Wyatt Spear, University of Oregon
The Impact of Intel Optane DC Persistent Memory on HPC - Ranjan Venkatesh, Georgia Institute of Technology
Variational system identication of the partial dierential equations governing the physics of pattern-formation - Zhenlin Wang, University of Michigan
C++ based Distributed Object Abstraction in HPX - Weile Wei, Louisiana State University
MVAPICH Touches Cloud - New Frontiers for MPI in High Performance Clouds - Shulei Xu, The Ohio State University
A data-driven approach for effective material properties prediction - Xiaoxuan Zhang, University of Michigan

Abstract

As the number of cores integrated in a modern processor package increases, the scalability of MPI intra-node communication becomes more important. To reduce the intra-node messaging overheads, MVAPICH2 provides the intra-node communication channels, such as shared memory and memory mapping. The memory mapping channel was designed particularly for large messages with kernel-level assistance (i.e., CMA and LiMIC2). In this talk, I will introduce new interfaces of the kernel-level assistance in MVAPICH2 to improve the performance of intra-node collective communications. Our preliminary results show that the new interfaces can reduce the latency of MPI_Bcast() up to 84% on an 120-core machine by perfectly overlapping the data copy operations.


Bio

Hyun-Wook Jin and Joong-Yeon Cho

Hyun-Wook Jin is a Professor in the Department of Computer Science and Engineering at Konkuk University, Seoul, Korea. He is leading the System Software Research Laboratory (SSLab) at Konkuk University. Before joining Konkuk University in 2006, He was a Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He received Ph.D. degree from Korea University in 2003. His main research focus is on operating systems for high-end computing systems and cyber-physical systems.

Abstract

Spack is an open-source package manager for HPC. Its simple, templated Python DSL allows the same package to be built in many configurations, with different compilers, flags, dependencies, and dependency versions. Spack allows HPC end users to automatically build any of over 3,000 community-maintained packages, and it enables software developers to easily manage large applications with hundreds of dependencies. These capabilities also enable Spack to greatly simplify HPC container builds. This presentation will give an overview of Spack, including recent developments and a number of items on the near-term roadmap. We will focus on Spack features relevant to the MVAPICH community; these include Spack's virtual package abstraction, which is used for API-compatible libraries including MPI implementations, package level compiler wrappers, and packages which modify other package's build environments. We will also touch on Spack workflows and how invidivuals and teams can accelerate their software deployment with Spack.


Bio

Gregory Blum Becker

Gregory Becker is a computer scientist at Lawrence Livermore National Laboratory. His focus is on bridging the gap between research and production software at LLNL. His work in software productization has led him to work on Spack, a package manager for high performance computing, as well scalable I/O formats for performance tools. Gregory has been at LLNL since 2015. He received his B.A. in Computer Science and Mathematics from Williams College in 2015.

Abstract

We will present LBANN’s unique capabilities that leverage large-scale platforms such as the Sierra supercomputer at LLNL for better strong scaling. Specifically, we will describe how our distributed convolution algorithms coupled with GPU-centric communication techniques realize both improved compute performance as well as more-capable models by exploiting fine-grained parallelism in large-scale convolutional neural networks. We will present recent performance results using the MVAPICH GDR library.


Bio

Naoya Maruyama

Naoya Maruyama is a researcher at Lawrence Livermore National Laboratory, where he studies cross-cutting domains of high performance computing and machine learning. Prior to joining LLNL, he was a Team Leader at RIKEN Advanced Institute for Computational Science, where he led research projects on high-level programming abstractions for heterogeneous architectures. He won several awards, including a Gordon Bell Prize in 2011 and Best Paper Award at SC16. He received Ph.D. in Computer Science from Tokyo Institute of Technology in 2008.

5:15 - 5:30

Open MIC Session

5:30 - 6:30

OSC Facilities Tour

The names of the people going on this tour need to be provided to the State facility 24 hours in advance. If you are interested in this tour, please send a note to the MUG address by Monday (August 19th) by 4:00 pm.
• It takes about an hour in total time from walking in the SOCC door to leaving. Transportation to and from the SOCC is additional time.
• The building is located at 1320 Arthur E. Adams Drive. Note this is NOT the same building as the main OSC offices.
• The building has TSA/Airport style security. Everyone must go through a metal detector/xray. As such, no weapons of any type are allowed in.
• Backpacks with 2 straps are also prohibited from being brought in. These need to be left at home.
• No photography of any sort is allowed outside the building, around security or in the hallways. Photography IS allowed within the actual room with the computers (e.g. leave your phones in your pockets/purses until we tell you it’s ok to take them out).
• Adults must bring some sort of government issued photo id. Minors don’t need id.

5:30 - 5:30

Shuttle Service to SpringHill Suites (by the hotel)

6:30 - 9:30

Banquet Dinner at Bravo Restaurant

1803 Olentangy River RD

Columbus, OH 43212

Wednesday, August 21

7:45 - 8:30

Registration and Continental Breakfast

8:00 - 8:00

Shuttle Service to Conference Center (by the hotel)

Abstract

MADNESS, TESSE/EPEXA, and MolSSI are three quite different large and long-lived projects that provide different perspectives and driving needs for the future of message passing. MADNESS is a general purpose environment for fast and accurate numerical simulation. Its initial use was in chemistry but it rapidly expanded to include boundary-value problems, nuclear physics, solid state physics, and atomic physics in intense laser fields. Other projects such as TiledArray employ the MADNESS. TESSE/EPEXA is a new C++ dataflow parallel programming interface that leverages the powerful PaRSEC parallel runtime that has been extended to support irregular and dynamic computation. Both MADNESS and TiledArray will migrate to the TESSE interface. The Molecular Sciences Software Institute (MolSSI) with support from NSF serves as a nexus for science, education, and cooperation serving the worldwide community of computational molecular scientists – a broad field including of biomolecular simulation, quantum chemistry, and materials science. All three of these projects employ MPI and have a vested interest in computation at all scales, spanning the classroom to future exascale systems.


Bio

Robert Harrison

Professor Robert Harrison is a distinguished expert in high-performance computing and theoretical chemistry, and is the Endowed Chair and Director of the Institute for Advanced Computational Science at Stony Brook University. Harrison is jointly appointed with Brookhaven National Laboratory where he is a Chief Scientist for the Computational Science Initiative. Dr. Harrison came to Stony Brook from the University of Tennessee and Oak Ridge National Laboratory, where he was Professor of Chemistry and Corporate Fellow, and was also Director of the Joint Institute for Computational Sciences that is home to the NSF supercomputer center, the National Institute for Computational Science. He has an active career with over two hundred publications and with extensive service on national and international advisory committees. In 2002 he received the IEEE Computer Society Sidney Fernback Award and has received two R&D 100 awards for the development of NWChem (1999) and MADNESS (2011).

Abstract

The EPFL Blue Brain Project (BBP) has been pushing the boundaries of the size, complexity and biological faithfulness of brain tissue simulations. A data-driven software pipeline is used to digitally reconstruct brain tissue that faithfully reproduces an array of laboratory experiments. To enable this, Blue Brain Project operates a dedicated computing system (BB5) which consists of different computing and storage elements (Intel KNLs, NVIDIA GPUs, Intel CPUs, DDN IME). In this talk we present the role MPI in different software pipelines including circuit building, simulation, 3-D visualization and large scale analysis. We especially focus on how the NEURON simulator is being optimised for large scale simulations using latest compiler technologies and MPI stack (from vendor and MVAPICH2 team).


Bio

Pramond Kumbhar

Pramod Kumbhar is HPC Architect in Computing Division at Blue Brain Project. His focus is on the development of the NEURON/CoreNEURON simulator within the Blue Brain Project. Over the years Pramod has been working on parallelisation, performance optimisation and scaling of scientific codes on various supercomputing architectures. Pramod has strong hands-on experience with variety of performance analysis tools at scale and micro-architecture level performance tuning. He also has a keen interest in domain specific languages (DSL) and modern compiler technologies. Before joining the Blue Brain Project, Pramod worked at the Jülich Research Centre, Germany.

Abstract

With a variety of new hardware technologies becoming available for HPC it is an exciting time at the moment for our community. One, potentially very important, future player in the HPC space will be ARM and there are already a number of ARM CPU based HPC system already available. In this talk I will describe work we have done in exploring the performance properties of MVAPICH, OpenMPI and MPT on one of these systems, Fulhame, which is an HPE Apollo 70-based system with 64 nodes of Cavium ThunderX2 ARM processors and Mellanox InfiniBand interconnect. In order to take advantage of these systems most effectively, it is very important to understand the performance that different MPI implementations can provide and any further opportunities to optimise these. Therefore, starting with the OSU benchmarks I will explore the different performance properties of these technologies on Fulhame and other systems for comparison, before then moving onto a number of more substantial applications.


Bio

Nicholas Brown

Dr Nick Brown is a Research Fellow at EPCC, the University of Edinburgh, with research interests in parallel programming language design, compilers and runtimes. He has worked on a number of large scale parallel codes including developing MONC, an atmospheric model used by the UK climate & weather communities which involves novel in-situ data analytics. He is also interested in micro-core architectures developing ePython, a very small memory footprint Python interpreter with parallel extensions, for many core, low memory chips. Nick is a course organiser on EPCC's MSc in HPC, as well as supervising MSc and PhD students.

10:30 - 11:00

Break

Abstract

The talk will describe learning and workforce development (LWD) programs within the Office of Advanced Cyberinfrastructure (OAC) in the CISE directorate at the National Science Foundation. OAC's mission is to support advanced cyberinfrastructure to accelerate discovery and innovation across all science and engineering disciplines. The programs specifically addressed include the CAREER program for faculty early career development, the CISE Research Initiation Initiative (CRII) for early career faculty who have not yet been a PI on a Federal grant, the Cybertraining program for research workforce preparation, and the OAC Core Research Program that is now part of the CISE Core Research programs solicitation.


Bio

Alan Sussman

Alan Sussman is currently program director in the Office of Advanced Cyberinfrastructure at NSF in charge of learning and workforce development programs, and is also active in software and data related cyberinfrastructure programs. He is on leave from his permanent position as a Computer Science professor at the University of Maryland. His research interests have focused on systems software support for large-scale applications that require high performance parallel and distributed computing. In addition, since 2010, he has been helping coordinate a curriculum initiative in parallel and distributed computing with the premise that every undergraduate student in Computer Science or Computer Engineering must acquire basic parallel computing skills. This curriculum has had wide adoption including its direct impact on ACM’s CS2013 Curriculum.

Abstract

The University of Cambridge Research Computing service hosts some of the UK’s largest HPC systems, with Intel x86, Intel many-core, and Nvidia GPU machines. These machines serve very diverse communities of researchers working across all domains. As well as different compute hardware, different components of the system are built with different interconnect technologies including Mellanox IB, Intel OPA and RDMA accelerated Ethernet. In this talk I shall discuss some of the challenges of supporting research in such a diverse environment, and how we use MVAPICH2 across all of the different systems to deliver the best performance from each pool of hardware.


Bio

Jeffrey Salmond

Jeffrey Salmond has a background in Theoretical Physics and Scientific Computing. His postgraduate work in the Laboratory for Scientific Computing at the University of Cambridge primarily involved writing high-performance code for computational fluid dynamics problems. In his current role as a Research Software Engineer and High-Performance Computing Consultant, Jeffrey optimises the performance of complex scientific codes across a range of projects. These scientific codes are utilised for research within the departments of Physics, Chemistry, and Engineering.

Abstract

AI Bridging Cloud Infrastructure (ABCI) is the world's first large-scale Open AI Computing Infrastructure, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST), Japan. It delivers 19.9 petaflops of HPL performance and world' fastest training time of 1.17 minutes in ResNet-50 training on ImageNet datasets as of July 2019. ABCI consists of 1,088 compute nodes each of which equipped with two Intel Xeon Gold Scalable Processors, four NVIDIA Tesla V100 GPUs, two InfiniBand EDR HCAs and an NVMe SSD. ABCI offers a sophisticated high performance AI development environment realized by CUDA, Linux containers, on-demand parallel filesystem, MPI, including MVAPICH, etc. In this talk, we focus on ABCI’s network architecture and communication libraries available on ABCI and shows their performance and recent research achievements.


Bio

Shinichiro Takizawa

Shinichiro Takizawa, Ph.D is a senior research scientist of AI Cloud Research Team, AI Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Japan. His research interests are data processing and resource management on large-scale parallel systems. He also works as a member of AI Bridging Cloud Infrastructure (ABCI) operation team and designs future ABCI services. Shinichiro Takizawa received Ph.D in Science from Tokyo Institute of Technology in 2009.

12:15 - 1:15

Lunch

Abstract

KISTIi's Nurion supercomputer features 8305 nodes with Intel Xeon Phi KNL (Knight Landing) processors (68 cores) and 132 nodes with Intel Skylake CPUs (2-socket, 40 cores). Nurion is a system consisting of compute nodes, CPU-only nodes, Omni-Path interconnect networks, burst buffer high-speed storage, Luster-based parallel file system, and water cooling device based on Rear Door Heat Exchanger (RDHx). We will present microbenchmark and application performance results using MVAPICH on the KNL nodes.


Bio

Minsik Kim

Minsik Kim is a researcher in the Supercomputing Infrastructure Center of the Korea Institute of Science and Technology Information (KISTI). He received the Ph.D. degree in Electrical and Electronic Engineering from Yonsei University in 2019. His research interests include neural network optimization on GPUs, computer architecture, and high-performance computing. He is a member of IEEE. More details about Dr. Kim is available at http://minsik-kim.github.io.

Abstract

SDSC supports HPC and Deep Learning applications on systems featuring K80, P100, and V100 GPUs. On the NSF funded Comet cluster there are primarily two types of GPU nodes:1) 36 nodes with Intel Haswell CPUs (2-socket, 24 cores) with 4 NVIDIA K-80 GPUs (two accelerator cards) each, and 2) 36 nodes with Intel Broadwell CPUs (2-socket, 28 cores) with 4 NVIDIA P100 GPUs on each. Additionally one node with 4 V100 GPUs is available for benchmarking and testing. Some of the deep learning applications are supported via the Singularity containerization solution. Application testing and performance results using MVAPICH2-GDR using the various types of nodes and containerization will be presented.


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems.

Abstract

The OSU InfiniBand Network Analysis and Monitoring (OSU INAM) tool has been running on OSC’s production systems for several months. In this talk we’ll give an overview of OSC’s HPC environment, IB fabric and our INAM deployment. It will include a discussion of OSC’s INAM configuration and improvements to the scalability of fabric discovery and optimization of database insertion/query rates resulting from our deployment. We'll also discuss integration with OSC's Torque/MOAB resource management and early experiences in analysis of job communication characteristics. At the end of the talk we'll give a short demo of INAM at OSC.


Bio

Karen Tomko and Heechang Na

Karen Tomko is the Director of Research Software Applications and serves as manager of the Scientific Applications group at the Ohio Supercomputer Center where she oversees deployment of software for data analytics, modeling and simulation. Her research interests are in the field of parallelization and performance improvement for High Performance Computing applications. She has been with OSC since 2007 and has been collaborating with DK Panda and the MVAPICH team for about 10 years.

2:45 - 3:15

Break

Abstract

The OpenPOWER Foundation is an open technical membership organization that will assist data centers in rethinking their approach to technology. OpenPOWER members are actively pursuing innovations and welcome all parties to join in moving OpenPOWER systems design forward The presentation will cover OpenPower servers capabilities and features. It will help understand how combined OpenSoftware OpenHardware architectures stimulates collaboration and are a key to success. Latest game-changing announcements made by OpenPOWER Foundation will be presented.


Bio

Alexandre Castellane

Alexandre Castellane is a Sr specialist engineer at IBM, France. He received a Master's degree in electronics engineering from the French engineering school ENSERG in 1992. After a long carrier in analog radio test and development for 2G/3G applications, He came to high speed digital logic through IBM's Packet Routing Switches as an application engineer. He came back to analog as a signal integrity specialist for few years before joining IBM's CAPI team to help promote CAPI/ OpenCAPI solutions. He still keeps a contact with hardware through FPGA coding for POC and testing purposes and is an open innovation enthusiast

Abstract

This lecture will help you understand what FPGA hardware acceleration provides and when it can be used in complement or replacement of GPU. SNAP framework provides software engineers with a mean to use this technology in a snap ! The unique advantages of POWER technology including CAPI / OpenCAPI technology coupled to SNAP framework will be presented. Details of what memory coherency and low latency associated to FPGA bring to you will be explored through very simple examples.


Bio

Alexandre Castellane

Alexandre Castellane is a Sr specialist engineer at IBM, France. He received a Master's degree in electronics engineering from the French engineering school ENSERG in 1992. After a long carrier in analog radio test and development for 2G/3G applications, He came to high speed digital logic through IBM's Packet Routing Switches as an application engineer. He came back to analog as a signal integrity specialist for few years before joining IBM's CAPI team to help promote CAPI/ OpenCAPI solutions. He still keeps a contact with hardware through FPGA coding for POC and testing purposes and is an open innovation enthusiast

4:00 - 5:00

OSC Facilities Tour

The names of the people going on this tour need to be provided to the State facility 24 hours in advance. If you are interested in this tour, please send a note to the MUG address by Tuesday (August 20th) by 4:00 pm.
• It takes about an hour in total time from walking in the SOCC door to leaving. Transportation to and from the SOCC is additional time.
• The building is located at 1320 Arthur E. Adams Drive. Note this is NOT the same building as the main OSC offices.
• The building has TSA/Airport style security. Everyone must go through a metal detector/xray. As such, no weapons of any type are allowed in.
• Backpacks with 2 straps are also prohibited from being brought in. These need to be left at home.
• No photography of any sort is allowed outside the building, around security or in the hallways. Photography IS allowed within the actual room with the computers (e.g. leave your phones in your pockets/purses until we tell you it’s ok to take them out).
• Adults must bring some sort of government issued photo id. Minors don’t need id.

5:30 - 5:30

Shuttle Service to SpringHill Suites (by the hotel)