MUG'18

Preliminary Advance Program

Conference Location: Ohio Supercomputer Center Bale Theater

Monday, August 06

8:45 - 8:45

Registration and Continental Breakfast

Abstract

Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. This tutorial will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. An approach for identifying communication bottlenecks will be presented that builds from the basics up to advanced features in Arm Forge. We will explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

10:30 - 11:00

Break

Abstract

Arm Forge is a cross-platform performance engineering toolkit comprised of Arm DDT and Arm MAP. DDT is a parallel debugger supporting a wide range of parallel architectures and models including MPI, UPC, CUDA and OpenMP, and MAP is a low-overhead line-level profiler for MPI, OpenMP and scalar programs. This tutorial will present Arm Forge and demonstrate how performance problems in applications using MVAPICH2 can be identified and resolved. An approach for identifying communication bottlenecks will be presented that builds from the basics up to advanced features in Arm Forge. We will explore custom metrics for MPI profiling and demonstrate how Arm Forge may be used on extreme-scale applications with extremely low overhead and little or no loss of capability.


Bio

John Linford

John Linford is a principal applications engineer at Arm. He has extensive experience creating, using, supporting, and deploying high performance computing applications and technologies. His research interests include emerging computer architectures, compilers, code generation, performance analysis, and numerical simulation (particularly atmospheric chemistry). He has developed tools for chemical kinetic simulation, rotorcraft engineering, software performance analysis, and software environment management.

Abstract

High performance computing has begun scaling beyond Petaflop performance towards the Exaflop mark. One of the major concerns throughout the development toward such performance capability is scalability - at the component level, system level, middleware and the application level. Mellanox’s Co-Design approach between the development of the software libraries and the underlying hardware can help to overcome those scalability issues and to enable a more efficient design approach towards the Exascale goal. In the tutorial session we will review latest development areas within the Co-Design architecture: SHArP Technology, Unified Communication X (UCX), Hierarchical Collectives, GPU architectures support, etc.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr.Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHArP, HCOLL, GPU acceleration ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda. At Nowlab, Devendar involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master's degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

12:30 - 1:30

Lunch

Abstract

High performance computing has begun scaling beyond Petaflop performance towards the Exaflop mark. One of the major concerns throughout the development toward such performance capability is scalability - at the component level, system level, middleware and the application level. Mellanox’s Co-Design approach between the development of the software libraries and the underlying hardware can help to overcome those scalability issues and to enable a more efficient design approach towards the Exascale goal. In the tutorial session we will review latest development areas within the Co-Design architecture: SHArP Technology, Unified Communication X (UCX), Hierarchical Collectives, GPU architectures support, etc.


Bio

Devendar Bureddy

Devendar Bureddy is a Sr.Staff Engineer at Mellanox Technologies. At Mellanox, Devendar was instrumental in building several key technologies like SHArP, HCOLL, GPU acceleration ..etc. Previously, he was a software developer at The Ohio State University in network-Based Computing Laboratory lead by Dr. D. K. Panda. At Nowlab, Devendar involved in the design and development of MVAPICH2, an open-source high-performance implementation of MPI over InfiniBand and 10GigE/iWARP. He had received his Master's degree in Computer Science and Engineering from the Indian Institute of Technology, Kanpur. His research interests include high speed interconnects, parallel programming models and HPC software.

Abstract

Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization is becoming more and more attractive. The recently introduced Single-Root I/O Virtualization (SR-IOV) technique for InfiniBand and High-Speed Ethernet on HPC clusters provides native I/O virtualization capabilities and opens up many opportunities to design efficient HPC clouds. However, SR-IOV also brings additional design challenges arising from lacking support of locality-aware communication and virtual machine migration. This tutorial will first present an efficient approach to build HPC clouds based on MVAPICH2 over SR-IOV enabled HPC clusters. High-performance designs of virtual machine (KVM) and container (Docker, Singularity) aware MVAPICH2 library (called MVAPICH2-Virt) will be introduced. This tutorial will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The second part of the tutorial will present advanced designs with cloud resource managers such as OpenStack and SLURM to make users easier to deploy and run their applications with the MVAPICH2 library on HPC clouds. A demo will be provided to guide the usage of MVAPICH2-Virt library.


Bio

Xiaoyi Lu

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, Cloud Computing, and Deep Learning. He has published more than 100 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, and Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 285 organizations from 34 countries. More than 26,950 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932/.

3:00 - 3:30

Break

Abstract

Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization is becoming more and more attractive. The recently introduced Single-Root I/O Virtualization (SR-IOV) technique for InfiniBand and High-Speed Ethernet on HPC clusters provides native I/O virtualization capabilities and opens up many opportunities to design efficient HPC clouds. However, SR-IOV also brings additional design challenges arising from lacking support of locality-aware communication and virtual machine migration. This tutorial will first present an efficient approach to build HPC clouds based on MVAPICH2 over SR-IOV enabled HPC clusters. High-performance designs of virtual machine (KVM) and container (Docker, Singularity) aware MVAPICH2 library (called MVAPICH2-Virt) will be introduced. This tutorial will also present a high-performance virtual machine migration framework for MPI applications on SR-IOV enabled InfiniBand clouds. The second part of the tutorial will present advanced designs with cloud resource managers such as OpenStack and SLURM to make users easier to deploy and run their applications with the MVAPICH2 library on HPC clouds. A demo will be provided to guide the usage of MVAPICH2-Virt library.


Bio

Xiaoyi Lu

Dr. Xiaoyi Lu is a Research Scientist of the Department of Computer Science and Engineering at the Ohio State University, USA. His current research interests include high performance interconnects and protocols, Big Data, Hadoop/Spark/Memcached Ecosystem, Parallel Computing Models (MPI/PGAS), Virtualization, Cloud Computing, and Deep Learning. He has published more than 100 papers in International journals and conferences related to these research areas. He has been actively involved in various professional activities (PC Co-Chair, PC Member, and Reviewer) in academic journals and conferences. Recently, Dr. Lu is leading the research and development of RDMA-based accelerations for Apache Hadoop, Spark, HBase, and Memcached, and OSU HiBD micro-benchmarks, which are publicly available from (http://hibd.cse.ohio-state.edu). These libraries are currently being used by more than 285 organizations from 34 countries. More than 26,950 downloads of these libraries have taken place from the project site. He is a core member of the MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) project and he is leading the research and development of MVAPICH2-Virt (high-performance and scalable MPI for hypervisor and container based HPC cloud). He is a member of IEEE and ACM. More details about Dr. Lu are available at http://web.cse.ohio-state.edu/~lu.932/.

6:00 - 9:30

Reception Dinner

Tuesday, August 07

7:45 - 8:15

Registration and Continental Breakfast

8:15 - 8:30

Opening Remarks

Abstract

The MPI Forum is currently working towards version 4.0 of the MPI standard, which is likely to include major feature additions like persistent collectives, improved error handling, large count extensions and an events-based tools interface. In the first part of this talk I will highlight these promising directions and discuss their state wrt. standardization. I will also discuss the overall direction of the MPI forum as well as the anticipated timeline for the next standard release. However, just defining new features in the standard alone is not sufficient — they must also be adopted in implementations. This can be a lengthy process requiring significant amounts of work. Open source distributions, like MVAPICH, play a critical role in these efforts and have, already in the past, been major drivers towards adoption. In the second part of my talk I will highlight this importance and use the adoption of the MPI_T interface, added in MPI 3.0, in MVAPICH as an example. The availability of a wide diversity of MPI_T variables offered by MVAPICH has not only demonstrated the value of the interface to the end user, but has also enabled matching developments of tools making use of MPI_T.


Bio

Martin Schulz

Martin Schulz is a Full Professor at the Technische Universität München (TUM), which he joined in 2017. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. His research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.

Abstract

This talk will provide an overview of the MVAPICH project (past, present, and future). Future roadmap and features for upcoming releases of the MVAPICH2 software family (including MVAPICH2-X, MVAPICH2-GDR, and MVAPICH2-Virt) will be presented. Current status and future plans for OSU INAM, OEMT, and OMB will also be presented.


Bio

Dhabaleswar K (DK) Panda

DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High-Performance MPI and PGAS over InfiniBand, iWARP, and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,900 organizations worldwide (in 85 countries). More than 480,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 2nd, 12th, 15th, 24th and 62nd ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 285 organizations in 34 countries. More than 26,900 downloads of these libraries have taken place. He is an IEEE Fellow. The group has also been focusing on co-designing Deep Learning Frameworks and MPI Libraries. A high-performance and scalable version of the Caffe framework is available from High-Performance Deep Learning (HiDL) Project site. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.

10:15 - 10:45

Break

Abstract


Bio

Adam Moody

Adam is a member of the Development Environment Group within Livermore Computing. His background is in MPI development, collective algorithms, networking, and parallel I/O. He is responsible for supporting MPI on Livermore's Linux clusters. He is a project lead for the Scalable Checkpoint / Restart library and mpiFileUtils -- two projects that use MPI to help users manage large data sets. He leads the CORAL burst buffer working group for Livermore. In recent work, he has been investigating how to employ MPI and fast storage in deep learning frameworks like LBANN.

Abstract

The Juelich Supercomputing Centre operates some of the largest supercomputers in Europe, with an installed capacity of over 17 petaflops. Like many other supercomputing centers, it has a variety of MPI runtimes available to its users. JSC is also the proponent of the Modular Supercomputing Architecture. The MSA has certain implications in the development of MPI runtimes. This talk will present a brief analysis of the performance of the MPI runtimes presently installed in the different supercomputers/modules. Looking to the future, the particularities of JSC's roadmap and users, and how that affects it's choice of MPI runtimes will also be introduced.


Bio

Damian Alvarez

Dr. Damian Alvarez joined the Jülich Supercomputing Centre (JSC) in 2011. There he is the s cientific software manager of production systems. He is also part of the ExaCluster Laboratory, a collaboration effort between JSC, Intel and ParTec that investigates novel technologies to reach Exascale, including the DEEP, DEEP-ER and DEEP-EST projects. His research interests include optimization in manycore processors, system architecture, novel programming models for high performance computing, PGAS languages, collectives optimization and management of scientific software in supercomputers.

12:15 - 12:30

Group Photo

12:30 - 1:30

Lunch

Abstract

The latest revolution in HPC and AI is the effort around the co-design collaboration, a collaborative effort among industry thought leaders, academia, and manufacturers to reach Exascale performance by taking a holistic system-level approach to fundamental performance improvements. Co-design recognizes that the CPU has reached the limits of its scalability, and offers an intelligent network as the new “co-processor” to share the responsibility for handling and accelerating application workloads. The session will describe the latest technology development and performance results from latest large scale deployments.


Bio

Gilad Shainer

Gilad Shainer is the vice president of marketing at Mellanox Technologies since March 2013. Previously, Mr. Shainer was Mellanox's vice president of marketing development from March 2012 to March 2013. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles between July 2005 and February 2012. Mr. Shainer holds several patents in the field of high-speed networking and contributed to the PCI-SIG PCI-X and PCIe specifications. Gilad Shainer holds a MSc degree (2001, Cum Laude) and a BSc degree (1998, Cum Laude) in Electrical Engineering from the Technion Institute of Technology in Israel.

Abstract

The Open Fabrics Interface (OFI) was envisioned and created in order to provide applications high-level application oriented communication semantics. Several application domains like MPI, PGAS, Streaming and RPC models are considered when designing which semantics OFI should support. While OFI provides applications a simpler to use interface, it also provides fabric vendors a faster innovation cycle and reduced software maintenance costs. OFI is completely vendor agnostic and fits a variety of underlying Fabric hardware design models underneath. Of the many hardware design models available, the Verbs based model has been around for several decades, starting with the VIA interface. In this talk, we focus on the development status and performance of OFI over fabrics that were designed using the Verbs model, such as InfiniBand, iWarp and RoCE.


Bio

Sayantan Sur

Sayantan Sur is a Software Engineer at Intel Corp, in Hillsboro, Oregon. His work involves High Performance computing, specializing in scalable interconnection fabrics and Message passing software (MPI). Before joining Intel Corp, Dr. Sur was a Research Scientist at the Department of Computer Science and Engineering at The Ohio State University. In the past, he has held a post-doctoral position at IBM T. J. Watson Research Center, NY. He has published more than 20 papers in major conferences and journals related to these research areas. Dr. Sur received his Ph.D. degree from The Ohio State University in 2007.

Abstract


Bio

Pavel Shamis

Pavel is a Principal Research Engineer at ARM with over 16 years of experience in development HPC solutions. His work is focused on co-design software and hardware building blocks for high-performance interconnect technologies, development communication middleware and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domain including: Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was responsible for development HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a recipient of prestigious R&D100 award for his contribution in development of the CORE-Direct collective offload technology. In addition, Pavel has contributed to multiple open specifications (OpenSHMEM, MPI, UCX) and numerous open source projects (MVAPICH, OpenMPI, OpenSHMEM-UH, etc).

3:00 - 3:45

Break and Student Poster Session

Abstract


Bio

Mahidhar Tatineni

Mahidhar Tatineni received his M.S. & Ph.D. in Aerospace Engineering from UCLA. He currently leads the User Services group at SDSC. He has led the deployment and support of high performance computing and data applications software on several NSF and UC resources including Comet, and Gordon at SDSC. He has worked on many NSF funded optimization and parallelization research projects such as petascale computing for magnetosphere simulations, MPI performance tuning frameworks, hybrid programming models, topology aware communication and scheduling, big data middleware, and application performance evaluation using next generation communication mechanisms for emerging HPC systems.

Abstract

NVIDIA was motivated by the need to create an analog of uniform virtual memory across both the host and the GPU device. This relieves the programmer from the burden of memory management, since the CUDA device driver transparently copies memory between GPU device and CPU host. While several successes were previously reported in the literature, progress on checkpointing GPUs came to a halt with the need to support CUDA's new unified virtual memory model, analogous to traditional virtual memory. A new system, CRUM, built on top of DMTCP, addresses this challenge by using the hardware virtual memory subsystem of the host to decouple computation state from the GPU device driver context. We then briefly observe how a similar strategy can be used on the Cori supercomputer at NERSC to support transparent checkpointing of the Crazy-proprietary GNI interconnection network for MPI. This approach also allows one to checkpoint over one implementation of MPI, and then restart under a second and different implementation of MPI.


Bio

Gene Cooperman and Rohan Garg

Professor Cooperman works in high-performance computing and scalable applications for computational algebra. He received his B.S. from U. of Michigan in 1974, and his Ph.D. from Brown University in 1978. He then spent six years in basic research at GTE Laboratories. He came to Northeastern University in 1986, and has been a full professor since 1992. In 2014, he was awarded a five-year IDEX Chair of Attractivity at from the Université Fédérale Toulouse Midi-Pyrénées, France. Since 2004, he has led the DMTCP project (Distributed MultiThreaded CheckPointing). Prof. Cooperman also has a 15-year relationship with CERN, where his work on semi-automatic thread parallelization of task-oriented software is included in the million-line Geant4 high-energy physics simulator. His current research interests emphasize studying the limits of transparent checkpoint-restart. Some current domains of interest are: supercomputing, cloud computing, engineering desktops (license servers, etc.), GPU-accelerated graphics, GPGPU computing, and the Internet of Things.

4:45 - 5:15

Open MIC Session

6:00 - 9:30

Banquet Dinner

Wednesday, August 08

7:45 - 8:30

Registration and Continental Breakfast

Abstract

This talk will describe how the supercomputers are enabling scientific discovery at SDSC for a decade: 2011 – 2021. Tens of thousands of researchers and across a wide range of domains have performed simulations and data processing on the three NSF funded supercomputers at SDSC – Trestles, Gordon and Comet. Researchers from traditional fields of astrophysics, biochemistry, geosciences, and engineering, non-traditional fields of neuroscience, social science, humanities and arts, data science fields of biomedical image processing, text processing, and genomics deep learning, and (high-throughput computing based) data processing fields of high energy physics (e.g. ATLAS, CMS experiments), multi-messenger astronomy (e.g. LIGO), and high-precision neutrino measurements (e.g. IceCube experiment) have all utilized these machines effectively. High performance interconnects and large scale parallel I/O subsystems have played a tremendous role in enabling these simulations and data processing. The MVAPICH2 library and the RDMA-based High-Performance Big Data project have been integral parts of the system software stack to make optimal use of these high performance hardware components.


Bio

Amitava Majumdar

Amit Majumdar is the Division Director of the Data Enabled Scientific Computing (DESC) division at the San Diego Supercomputer Center (SDSC) and an Associate Professor in the Department of Radiation Medicine and Applied Sciences at the University of California San Diego.  His research interests are in high performance computing, computational science, cyberinfrastructure and science gateways. He has developed parallel algorithms and implemented them on various kinds of HPC machines and is interested in understanding performance and scalability of scientific applications on HPC machines. He is the PI of multiple research projects funded by the NSF, NIH, DOD, AFOSR and industries such as Intel, Microsoft. He received his bachelor’s degree in Electronics and Telecommunication Engineering from the Jadavpur University, Calcutta, India, and master’s degree in Nuclear Engineering from the Idaho State University, Pocatello, ID and doctoral degree in the interdisciplinary program of Nuclear Engineering and Scientific Computing from the University of Michigan, Ann Arbor, MI.

Abstract


Bio

Karl Schulz

Karl W. Schulz received his Ph.D. in Aerospace Engineering from the University of Texas in 1999. After completing a one- year post-doc, he transitioned to the commercial software industry working for the CD-Adapco group as a Senior Project Engineer to develop and support engineering software in the field of computational fluid dynamics (CFD). After several years in industry, Karl returned to the University of Texas in 2003, joining the research staff at the Texas Advanced Computing Center (TACC), a leading research center for advanced computational science, engineering and technology. During his 10-year term at TACC, Karl was actively engaged in HPC research, scientific curriculum development and teaching, technology evaluation and integration, and strategic initiatives serving on the Center's leadership team as an Associate Director and leading TACC's HPC group and Scientific Applications group during his tenure. He was a Co-principal investigator on multiple Top-25 system deployments serving as application scientist and principal architect for the cluster management software and HPC environment. Karl also served as the Chief Software Architect for the PECOS Center within the Institute for Computational Engineering and Sciences, a research group focusing on the development of next-generation software to support multi-physics simulations and uncertainty quantification. Karl joined the Technical Computing Group at Intel in January 2014 and is presently a Principal Engineer engaged in the architecture, development, and validation of HPC system software.

Abstract

High-performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job place-holders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. We discuss the importance of a scalable and general-purpose Pilot systems to support high-performance workflows. We describe the formal properties of a Pilot system and discuss the design, architecture and implementation of RADICAL-Pilot. We discuss how RADICAL-Pilot has been integrated with other application-level tools as a runtime system, and thus its value as an important “building block” for high-performance workflows.


Bio

Shantenu Jha

Shantenu Jha is an Associate Professor of Computer Engineering at Rutgers University, and Department Chair for Data Driven Discovery at Brookhaven National Laboratory. His research interests are at the intersection of high-performance distributed computing and computational science. Shantenu leads the the RADICAL-Cybertools project which are a suite of middleware building blocks used to support large-scale science and engineering applications. He collaborates extensively with scientists from multiple domains -- including but not limited to Molecular Sciences, Earth Sciences and High-Energy Physics. He was appointed a Rutgers Chancellor's Scholar (2015-2020) and was the recipient of the inaugural Chancellor's Excellence in Research (2016) for his cyberinfrastructure contributions to computational science. He is a recipient of the NSF CAREER Award (2013) and several prizes at SC'xy and ISC'xy. More details can be found at http://radical.rutgers.edu/shantenu

10:30 - 11:00

Break

Abstract


Bio

Sushil Prasad

Sushil K. Prasad is a Program Director at National Science Foundation in its Office of Advanced Cyberinfrastructure (OAC) in Computer and Information Science and Engineering (CISE) directorate. He is an ACM Distinguished Scientist and a Professor of Computer Science at Georgia State University. He is the director of Distributed and Mobile Systems Lab carrying out research in Parallel, Distributed, and Data Intensive Computing and Systems. He has been twice-elected chair of IEEE-CS Technical Committee on Parallel Processing (TCPP), and leads the NSF-supported TCPP Curriculum Initiative on Parallel and Distributed Computing for undergraduate education.

Abstract

The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. To meet the needs of computational scientists to evaluate and improve the performance of their applications, we present TAU's new features including support for the MPI Tools (MPI_T) interface for interfacing with MPI's performance and control variables exported by MVAPICH, OMPT TR6 for OpenMP instrumentation, and APIs for instrumentation of Python, Kokkos, and CUDA applications. TAU uses these interfaces on unmodified binaries without the need for recompilation. This talk will describe these new instrumentation techniques to simplify the usage of performance tools including support for compiler-based instrumentation, rewriting binary files, preloading shared objects, automatic instrumentation at the source-code level, CUDA, OpenCL, and OpenACC instrumentation. The talk will also highlight TAU's analysis tools including its 3D Profile browser, ParaProf and cross-experiment analysis tool, PerfExplorer. http://tau.uoregon.edu


Bio

Sameer Shende

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, and compiler optimizations. He serves as the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

Abstract


Bio

Alessandro Fanfarillo

Dr. Alessandro Fanfarillo is a Senior Software Engineer at the National Center for Atmospheric Research. His work is mostly devoted to performance enhancement of parallel weather models, GPGPU computing and software design/refactoring of scientific codes. His research focuses on how to exploit heterogeneous architectures CPU+Accelerators and Partitioned Global Address Space (PGAS) languages (in particular coarray Fortran) for scientific purposes. He is also the lead developer of OpenCoarrays, the open-source library that implements the coarray support in the GNU Fortran compiler.

12:45 - 1:30

Lunch

Abstract

Eulerian-Lagrangian couplings are nowadays widely used to address engineering and technical problems. In particular, CFD-DEM couplings have been successfully applied to study several configurations ranging from mechanical, to chemical and environmental engineering. However, such simulations are normally very computationally intensive, and the execution time represents a major issue for the applicability of this numerical approach to complex scenarios. With this work, we introduce a novel coupling approach aiming at improving the performance of the parallel CFD-DEM simulations. This strategy relies on two points. First, we propose a new partition-collocation strategy for the parallel execution of CFD–DEM couplings, which can considerably reduce the amount of inter-process communication between the CFD and DEM parts. However, this strategy imposes some alignment constraints on the CFD mesh. Secondly, we adopt a dual-grid multiscale scheme for the CFD-DEM coupling, that is known to offer better numerical properties, and that allows us to obtain more flexibility on the domain partitioning overcoming the alignment constraints. We assess the correctness and performance of our approach on elementary benchmarks and at a large scale with a realistic test-case. The results show a significant performance improvement compared to other state-of-art CFD-DEM couplings presented in the literature.


Bio

Xavier Besseron and Sébastien Varrette

Xavier Besseron is a permanent Research Scientist at the University of Luxembourg. He graduated in 2010 from the Grenoble University (France) with a PhD on Computer Science. His PhD work was on Fault Tolerance and Dynamic Reconfiguration for large scale distributed applications. From October 2010 to September 2011, his was a postdoc Researcher at the Ohio State University (USA) under the supervision of Prof. D.K. Panda in the Network Based Computing Lab (NOWLAB). During that time, he contributed to the MVAPICH project, a High Performance implementation of MPI for Infiniband clusters. In October 2010, he joined the University of Luxembourg, first as a Postdoc Research in the Parallel Computing & Optimisation Group (PCOG) of Prof. Pascal Bouvry. Now he's part of the Luxembourg XDEM Research Centre (LuXDEM) and works under the supervision of Prof. Bernhard Peters on the optimisation and parallelization of eXtended Discrete Element Method (XDEM). His research interests are High Performance Computing and Computational Sciences, and in particular parallelization, optimization, debugging and coupling of scientific HPC applications.

Abstract

Communication traces are increasingly important, both for parallel applications’ performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces. To reduce the size of communication traces, existing dynamic compression methods introduce large compression overhead with the job scale. We propose a hybrid static-dynamic method that leverages information acquired from static analysis to facilitate more effective and efficient dynamic trace compression. Our proposed scheme, CYPRESS, extracts a program communication structure tree at compile time using inter-procedural analysis. This tree naturally contains crucial iterative computing features such as the loop structure, allowing subsequent runtime compression to “fill in”, in a “top-down” manner, event details into the known communication template. Results show that CYPRESS reduces intra-process and inter-process compression overhead up to 5X and 9X respectively over state-of-the-art dynamic methods, while only introducing very low compiling overhead.


Bio

Jidong Zhai

Jidong Zhai is an associate professor in the Computer Science Department of Tsinghua University and a visiting assistant professor of Stanford University (2015-2016). He received the Ph.D. degree in Computer Science from Tsinghua University in 2010, with the Excellent Ph.D. Graduate Student Award of Tsinghua University. His research is focusing on high performance computing, compiler optimization, performance analysis and optimization of large-scale parallel applications. His research received a Best Paper Finalist at SC’14. The team led by him has achieved 7 champion titles of student supercomputing challenges at SC, ISC, and ASC. He served or is now serving TPC members or reviewers of IEEE TPDS, IEEE TCC, SC, ICS, PPOPP, ICPP, NAS, LCPC, and HPCC. He is a co-chair of ICPP PASA workshop 2015 and a program co-chair of NPC 2018. He is a recipient of 2010 Siebel Scholar and CCF outstanding doctoral dissertation award.

2:45 - 3:15

Break

Abstract

The tutorial will start with an overview of the MVAPICH2 libraries and their features. Next, we will focus on installation guidelines, runtime optimizations and tuning flexibility in-depth. An overview of configuration and debugging support in MVAPICH2 libraries will be presented. High-performance support for GPU-enabled clusters in MVAPICH2-GDR and KNL-based systems in MVAPICH2-X will be presented. The impact on performance of the various features and optimization techniques will be discussed in an integrated fashion. `Best Practices' for a set of common applications will be presented. A set of case studies related to application redesign will be presented to take advantage of hybrid MPI+PGAS programming models.


Bio

Hari Subramoni

Dr. Hari Subramoni is a research scientist in the Department of Computer Science and Engineering at the Ohio State University. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data and cloud computing. He has published over 50 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of IEEE. More details about Dr. Subramoni is available at http://www.cse.ohio-state.edu/~subramon.