Extreme-scale Earthquake Simulation with MVAPICH

Yifeng Cui (yfcui@sdsc.edu)
San Diego Supercomputer Center
MUG’23, Aug 21-23, 2023
AWP-ODC Evolution &
Large-scale Earthquake Simulation
Why High Frequency Earthquake Modeling

Seismic band

- low-order free oscillations
- mantle waves
- crustal waves
- basin waves
- strongly scattered waves

Earthquake engineering band

- tall buildings
- houses
- stiff structures

SAF simulation 2023

- physics-based deterministic

SAF simulation 2026

- physics-based deterministic

Must validate new physics

- fault roughness
- near-fault plasticity
- frequency-dependent attenuation
- Topography
- small-scale near-surface heterogeneity
- multi-surface nonlinearity

CyberShake 1 Hz

- empirical stochastic

Snapshot from linear (left) and nonlinear (right) simulations using AWP-ODC showing wave propagation during a magnitude 7.7 SAF earthquake, Roten, SC’16

Snapshots from linear (left) and nonlinear (right) simulations using AWP-ODC showing wave propagation during a magnitude 7.7 SAF earthquake, Roten, SC’16

- (Snapshots from linear (left) and nonlinear (right) simulations using AWP-ODC showing wave propagation during a magnitude 7.7 SAF earthquake, Roten, SC’16)
**AWP-ODC**

- Started as personal research code (Olsen 1994)
- 3D velocity-stress wave equations
  \[
  \partial_t \nu = \frac{1}{\rho} \nabla \cdot \sigma \quad \partial_t \sigma = \lambda (\nabla \cdot \nu) I + \mu (\nabla \nu + \nabla \nu^T)
  \]
  solved by explicit staggered-grid 4th-order FD
- Memory variable formulation of inelastic relaxation

\[
\sigma(t) = M_u \left[ \varepsilon(t) - \sum_{i=1}^{N} \zeta_i(t) \right] + \tau_i \frac{d\zeta_i(t)}{dt} + \zeta_i(t) = \frac{\lambda_i}{M_u} \delta M \varepsilon(t)
\]

using coarse-grained representation (Day 1998)
- Dynamic rupture by the staggered-grid split-node (SGSN) method (Dalguer and Day 2007)
  - Displacement nodes split at fault surface: explicitly discontinuous displacement & velocity
  - All interactions between sides occur through traction vector at displacement node
- Absorbing boundary conditions by perfectly matched layers (PML) (Marcinkovich and Olsen 2003) and Cerjan et al. (1985)
The Earthquake System Science Challenges at Extreme-Scale

Evolution of AWP-ODC

AWP-ODC simulation allocation annually ca. 200-300M core-hours in recent years, supported by DOE INCITE/ALCC and NSF LRAC programs.
Linear Earthquake Simulation, 2010

- 881,475 subfaults, 250s of rupture
- 436 billion grid points representing SCEC Community Velocity Model V4 of dimension 810x405x85 km (spatial resolution of 40 m)
- Minimum shear-wave velocity of 400 m/s
- 368 s of ground motions (160,000 time steps of 0.0023 s) representing seismic frequencies up to 2 Hz
- Wave propagation simulation performed on Jaguar, 24 hours using 223,074 cores (220 Tflop/s sustained)
- Magnitude 8.0 wall-to-wall scenario, worst-case for southern San Andreas Fault
- Fault length: 545 km, minimum wavelength: 200 m, NW→SE rupture propagation
- Dynamic rupture simulation performed on Kraken, 7.5 hours using 2160 cores
- DK Panda team was part of the M8 effort

(Cui et al., SC’10, Gordon Bell finalist)
0-4 Hz Single-surface J2 Nonlinear ShakeOut Simulation, 2016

- A First 4-Hz nonlinear M7.7 earthquake simulation on the southern San Andreas Fault
- Nonlinear dynamic rupture simulation was conducted using 24,000 CPU-cores on Blue Waters, running 37 hrs
- Nonlinear wave propagation simulation was conducted using 4,200 GPUs on Titan, running 12 hours
- Initially 400% computing time required compared to linear code. With optimized yield factor interpolation, this reduces the computing time from 400% to 165% only

(Video: https://www.youtube.com/watch?v=qOH0Oj3t6QM)

- Inside the Whittier Narrows corridor, spectral accelerations at 3 seconds (3s-SAs) are reduced from 1g in the linear case to 0.3-0.6g in the nonlinear case, depending on the choice of reference strain.
- Plastic simulations obtained with a single von Mises yield surface predict 3s-SAs that are higher than those obtained with the multi-surface Iwan model, but lower than the linear values.

(Roten et al., SC’16)

(Roten et al., 2016)
0-4 Hz Multi-surface Iwan Nonlinear ShakeOut Simulation, 2023

- A multi-surface Iwan type plasticity model in AWP-CPU, verified against the established codes for 1D and 2D SH-wave benchmarks, has been applied to predict the impact of realistic soil nonlinearity on long-period surface waves during large earthquakes on the southern San Andres fault.

- While ShakeOut simulations with a single yield surface reduces long period ground motion amplitudes by ~25% inside a wave guide in greater LA, Iwan nonlinearity further reduces the values by a factor of two.

- Computational requirements with Iwan model is 20-30x more expensive, and memory use 5-13x more compared to linear solution.

- Run 22.5 hrs using 7,680 TACC Frontera nodes.

Linear

Iwan (Darendeli)

Max. shear modulus reduction at the surface

(Roten et al., BSSA, 2023, accepted)
The ShakeOut Scenario
M7.8 Earthquake on Southern San Andreas Fault

Scenario Results
- M7.8 mainshock
  - Broadband ground motion simulation (0-10 Hz)
- Large aftershocks
  - M7.2, M7.0, M6.0, M5.7...
- 10,000-100,000 landslides
- 1,600 fire ignitions
- $213 billion in direct economic losses
  - 300,000 buildings significantly damaged
  - Widespread infrastructure damage
  - 270,000 displaced persons
  - 50,000 injuries
  - 1,800 deaths
- Long recovery time

Exercise Results
- Largest emergency response exercise in US history, 45M people worldwide participating in 2022
  - Golden Guardian exercise
  - Public events involving multi-million registered participants
- Demonstrated that existing disaster plans are inadequate for an event of this scale
  - Motivated reformulation of system preparedness and emergency response
- Scientific basis for the LA Seismic Safety Task Force report, Resilience by Design

Great Southern California ShakeOut
November 13, 2008

Waveguide amplification in LA Basin
- Caused by string of contiguous sedimentary basins (Olsen et al, 2006, 2009)
- ShakeOut scenario predicts strong long-period ground motions in Los Angeles region
- Hazard to pre-Northridge high-rise buildings
- All these approaches assume a linear stress-strain relationship in the fault damage zone and shallow sediments
- Simulations with DP-plasticity predict 30-70% lower ground motions than linear solutions (Roten et al., 2014, 2017)
AWP-ODC I/O features have been converted to a separate library called SEISM-IO, supported by NSF SI2 program

(Cui et al., SC’10)
Porting to GPUs – 2012

- Two-layer 3D domain decomposition on CPU-GPU based heterogeneous supercomputers
  - first step X&Y decomposition for CPUs
  - second step Y&Z decomposition for GPU SMs

(Zhou et al., ICCS 2012, Cui et al., SC’13)
Single-GPU Optimizations - 2012

✔ **Step 2: GPU 2D Decomposition in y/z vs x/y**
✔ **Step 3: Global memory Optimization**
  Global memory coalesced, texture memory for six 3D constant variables, constant memory for scalar constants
✔ **Step 4: Register Optimization**
  Pipelined register copy to reduce memory access
✔ **Step 5: L1/L2 cache vs shared memory**
  Rely on L1/L2 cache rather on-chip shared memory

(Zhou, J et al., ICCS 2012)
Computing and Communication Overlapping on GPUs - 2013

(Zhou et al. ICCS 2013, Cui et al. SC’13)
Porting DP-Plasticity on GPUs – 2016

Communication / computation schedule

Nonlinear case

Linear case

(Roten et al., SC’16)
**Porting Iwan Model on CPUs and GPUs – 2021**

- **Computational challenges:**
  - Computationally expensive: separate stress and plasticity update required for each yield surface.
  - Memory requirements: each yield surface requires a separate copy of stress tensor $\tau_{xx}$, $\tau_{yy}$, $\tau_{zz}$, $\tau_{xz}$, $\tau_{yz}$, $\tau_{xy}$, Lamé parameters $\mu$, $\lambda$, and yield factor $r$.
  - MPI communication overhead: stress tensor and yield factor of each yield surface needs to be swapped during each time step (reduced scalability).
  - Shear modulus reduction reduces max. resolvable frequency.
  - 10-20x more expensive compared to our 2016 nonlinear simulation which used a simple J2 nonlinear material model, or 20-30x compared to linear solution.
  - Memory increased by $(1 + 0.4 \times Nspr)$ to linear simulation ($Nspr = nr of yield surfaces$).

- **Iwan Concept**
  - Hysteretic yielding behavior of material represented by a collection of perfectly elasto-plastic spring-slider elements, each element has different constants, shared strain and a fraction of stress, generalized to 3D using a collection of Drucker-Prager yield surfaces.
Porting to Intel Xeon Phi – 2017

- Stencil generation and vector folding through YASK tool: [https://github.com/01org/yask](https://github.com/01org/yask)
- Hybrid placement of grids in DDR and MCDRAM
- Normalized cross architecture evaluation in Mega Lattice Updates per Second (MLUPS): Xeon Phi KNL 7290 achieves 2x speedup over NVIDIA K20X, 97% of NVIDIA Tesla P100 performance
- Performance on 9,000 nodes of Cori-II equivalent to performance of over 20,000 K20X GPUs at 100% scaling
- Memory bandwidth accurately predicts performance of architectures (as measured by STREAM and HPCG-SpMv)

(Tobin et al., ISC’17)
Porting Discontinuous Mesh on GPUs – 2018

Kim Olsen, SDSU

• Let the interpolation be expressed as: \( u = W \ast U \), where \( U \) is the field value on the coarse grid, \( u \) is the missing point on the fine grid and \( W \) is the interpolation operator matrix

• Corresponding downsampling method: \( U' = W^T \ast u' \), where \( u' \) is the field in the fine grid region, \( U' \) is located in the coarse grid, and we set \( W^T \) as downsampling matrix

• Significant performance improvement with respect to a uniform grid solution
  ❖ A factor of 4 achieved for simulating the M9 megathrust earthquake in Cascadia, 650x1000x60 km\(^3\), 100/300m mesh sizes
  ❖ A factor of 8 achieved for simulating the Mw 5.1 La Habra earthquake up to 4 Hz, using a grid spacing of 20 m in the fine grid and a minimum shear-wave velocity of 500 m/s

Using a DM with \( dx^{\text{fine}} = 100 \text{ m} \) in upper 1 km, \( dx^{\text{coarse}} = 300 \text{ m} \) in bottom 39 km, resulting in 0.28B grids or 72% reduction in grid points

(Nie et al., BSSA 2017, Roten et al., 2018)
Topography has been added to AWP-ODC in GPU code, a separate version using curvilinear grids.

Comparable accuracy to the code on a Cartesian grid, with negligible extra memory requirements, longer simulation times due to small timesteps for complex topography.

Perfectly recover a forward simulation using reciprocity – a key result needed for CyberShake-related work.

94% weak scaling efficiency tested up to 1024 GPUs.

Future plan is to let this curvilinear grid rest on top of layers of Cartesian grids that extend downward with depth.

Figure 1: (a) Curvilinear grid, used for discretizing topography, overlaying cartesian grids with decreasing grid resolution with depth. (b) Arrangement of velocity and stresses in a curvilinear staggered grid.

(O’Reilly et al., BSSA 2022)
Porting to Microsoft Azure – 2022
Co-PI: Hari Subramoni

Challenges
• Digesting the wide breadth of options and configurations
• Higher threshold of initial setup needed
• Lack of comprehensive forums for debugging errors

Benefits
• Wide flexibility and options of hardware and software allows infrastructure to be tailored to specific workload
• Spin up large VM instances instantly without waiting in a queue/system quotas
• We demonstrated that the AWP performance with a benchmark of ground motion simulation on various GPU based cloud instances, and a comparison of the cloud solution to on-premises bare-metal systems.

Microsoft Internet2/Azure Accelerator for Research Fall 2022 program, $7k credits awarded through Cloudbank
Future plan is to compare performance with MVAPICH2-AZURE

Accelerating Earthquake Simulation on Microsoft Azure

<table>
<thead>
<tr>
<th>Specs</th>
<th>Azure (NC24rs)</th>
<th>Expanse</th>
<th>Summit</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPUs/Node</td>
<td>4 x V100</td>
<td>4 x V100</td>
<td>6 x V100</td>
</tr>
<tr>
<td>CPU</td>
<td>Xeon E5-2690 v4</td>
<td>Xeon Gold 6248</td>
<td>IBM Power 9</td>
</tr>
<tr>
<td>Memory/Node (GB)</td>
<td>480</td>
<td>384</td>
<td>512</td>
</tr>
<tr>
<td>Compiler:</td>
<td>OpenMPI</td>
<td>OpenMPI</td>
<td>IBM XL Compiler</td>
</tr>
<tr>
<td>File System:</td>
<td>NFS</td>
<td>Lustre</td>
<td>GPFS</td>
</tr>
<tr>
<td>Infinniband (Gbps):</td>
<td>FDR(56)</td>
<td>HDR(200)</td>
<td>EDR(100)</td>
</tr>
</tbody>
</table>

(Palla, SCEC'23)
Porting CUDA Linear Code to HIP – 2023

AWP-ODC Weak Scaling on DOE and NSF LCCFs (Linear version vs nonlinear versions)

(Cui et al., SCEC’23)
Accelerate AWP-ODC Performance with MVAPICH
Communication Enhancement on CPUs - 2010

- **Performance challenge**
  - Large variation in communication latencies among neighbors
  - System/user memory overhead

- **Scalability challenge**
  - Increased latency for larger simulation

(Cui et al., SC’10)

- Asynchronous communication
- Rank re-placement
- Message pre-posting without data reorders
- Computation and communication overlapping, 2-sided and 1-sided
Communication Enhancement on CPUs - 2010

• Asynchronous communication
  – Significantly reduced latency through local communication
  – Reduced system buffer requirement through pre-post receives

(Cui et al., SC’10)
Communication Enhancement on CPUs - 2010

• Asynchronous communication
  – Significantly reduced latency through local communication
  – Reduced system buffer requirement through pre-post receives

• Computation/communication overlap
  – Effectively hide computation times
  – Effective when $T_{compute\_hide} > T_{compute\_overhead}$
Asynchronous communication
- Significantly reduced latency through local communication
- Reduced system buffer requirement through pre-post receives

Computation/communication overlap
- Effectively hide computation times
- Effective when \( T_{\text{compute\_hide}} > T_{\text{compute\_overhead}} \)
- MPI-1 non-blocking two-sided Communications

Velocity Exchange
\[
s2n(u1, \text{north-mpirank}, \text{south-mpirank})
\]
! recv from south, send to north
\[
n2s(u1, \text{south-mpirank}, \text{north-mpirank})
\]
! send to south, recv from north
\[
... \text{repeat for east-west, up-down directions}
\]
\[
... \text{repeat for other velocity components v1, w1}
\]
\[
\text{wait\_onedirection()}
\]
\[
s2nfill(u1, \text{recvbuffer}, \text{south-mpirank})
\]
\[
n2sfill(u1, \text{recvbuffer}, \text{north-mpirank})
\]
\[
... \text{repeat for east-west, up-down directions}
\]
\[
... \text{repeat for other velocity components v1, w1}
\]

S2N
Copy 2 planes of data from variable to sendbuffer
\[
\text{COPY north boundary excluding ghost cells}
\]
\[
\text{MPI\_send(sendbuffer, north-mpirank)}
\]
\[
\text{MPI\_recv(recvbuffer, south-mpirank)}
\]

WAIT\_ONEDIRECTION
\[
\text{MPI\_Waitall(list of receive requests)}
\]

S2NFILL
Copy 2 planes of data from recvbuffer to variable
\[
! \text{copy to south ghost cells}
\]

(Potluri, S., et al., ICS’10)
Communication Enhancement on CPUs - 2010

Sreeram Potluri of DK Panda Team

• Asynchronous communication
  – Significantly reduced latency through local communication
  – Reduced system buffer requirement through pre-post receives

• Computation/communication overlap
  – Effectively hide computation times
  – Effective when $T_{compute\_hide} > T_{compute\_overhead}$
  – MPI-1 non-blocking two-sided Communications
  – MPI-2 one-sided Communications (on Ranger)

```
MPI_Win_post(group, 0, window) ! pre-posting the window to all neighbors

Main Loop in AWM-Olsen
  Compute velocity component u
  Start exchanging velocity component u
  Compute velocity component v
  Start exchanging velocity component v
  Compute velocity component w
  Start exchanging velocity component w
  Complete Exchanges of u,v and w
  MPI_Win_post(group, 0, window)
  ! For the next iteration

Start exchange
  MPI_Win_start(group, 0, window)
  s2n(u1, north-mpirank, south-mpirank)
  ! recv from south, send to north
  n2s(u1, south-mpirank, north-mpirank)
  ! send to south, recv from north
  ... repeat for east-west and up-down

Complete exchange
  MPI_Win_wait(window)
  MPI_Win_complete(window)
  s2nfill(u1, window buffer, south-mpirank)
  n2sfill(u1, window buffer, north-mpirank)
  ... repeat for east-west and up-down

S2N
  Copy 2 planes of data from variable to sendbuffer
  ! copy north boundary excluding ghost cells
  MPI_Put(sendbuffer, north-mpirank)

S2NFILL
  Copy 2 planes of data from window buffer to variable
  ! copy into south ghost cells
```

(Potluri, S., et al., ICS’10)
Communication Enhancement on CPUs - 2010
Sreeram Potluri of DK Panda Team

• Asynchronous communication
  – Significantly reduced latency through local communication
  – Reduced system buffer requirement through pre-post receives

• Computation/communication overlap
  – Effectively hide computation times
  – Effective when $T_{\text{compute\_hide}} > T_{\text{compute\_overhead}}$
  – MPI-1 non-blocking two-sided Communications
  – MPI-2 one-sided Communications (on Ranger)

(Potluri, S., et al., ICS’10)
Iwan Code Performance on TACC Frontera

module load intel/18.0.5 mvapich2-x/2.3
export MV2_USE_MCAST=0
export MV2_USE_RDMA_CM_MCAST=0
export MV2_SMP_EAGERSIZE=28673
export MV2_SMP_NUM_SEND_BUFFER=8192

module load intel/18.0.5 mvapich2-x/3.oa2
export MV2_USE_MCAST=0
export MV2_USE_RDMA_CM_MCAST=0
export MV2_SMP_EAGERSIZE=28673
export MV2_SMP_NUM_SEND_BUFFER=8192

AWP-CPU-Iwan Weak Scaling on TACC Frontera

- Intel MPI 18.0.5
- MPICH-3.0a2
- MVAPICH-2.3
- Intel Parallel Eff.
CUDA-aware Support Enhances AWP-ODC Performance

- MVAPICH2 improves performance 20% over OpenMPI on Expanse, connected via NVLinks
- MVAPICH2 improves performance 20% over IMPI on Lonestar-6, connected via HDR
- CUDA-aware supported code gains additional 14% in MVAPICH2/2.37-gdr over MVAPICH-2

<table>
<thead>
<tr>
<th>Expanses A100s</th>
<th>Teraflop/s</th>
<th>Time (sec/step)</th>
</tr>
</thead>
<tbody>
<tr>
<td>gcc10.2.0+openmpi4.1.3 (2x2)</td>
<td>2.22</td>
<td>0.0294</td>
</tr>
<tr>
<td>nvhpc21.9 (openmpi4.1.1) (2x2)</td>
<td>2.21</td>
<td>0.0295</td>
</tr>
<tr>
<td>intel19.0.5+mvapich2/2.3.4 (2x2)</td>
<td>2.70</td>
<td>0.0243</td>
</tr>
<tr>
<td>intel19.0.5+mvapich2/2.3.7 (4x2)</td>
<td>3.55</td>
<td>0.0370</td>
</tr>
<tr>
<td>intel19.0.5+mvapich2/2.3.7-gdr (4x2)</td>
<td>4.03</td>
<td>0.0326</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Lonestar 6 A100s</th>
<th>Teraflop/s</th>
<th>Time (sec/step)</th>
</tr>
</thead>
<tbody>
<tr>
<td>gcc11.2.0+impi19.0.9 (2x3)</td>
<td>1.68</td>
<td>0.0585</td>
</tr>
<tr>
<td>gcc11.2.0+mvapich2/2.3.7 (2x3)</td>
<td>2.03</td>
<td>0.0488</td>
</tr>
<tr>
<td>gcc11.2.0+mvapich2/2.3.7 gdr (2x3)</td>
<td>2.30</td>
<td>0.0399</td>
</tr>
<tr>
<td>gcc11.2.0+mvapich2/latest gdr (2x3)</td>
<td>3.15</td>
<td>0.0311</td>
</tr>
</tbody>
</table>
On-the-fly Compression on GPUs – 2021
Qinghua Zhou of DK Panda team, IPDPS’21 Best Paper finalist

- **Motivation**
  - AWP-ODC has significant communication times on large-scale
  - Disparity between intra-node and inter-node GPU communication bandwidths that prevent efficient scaling

- **Implementation**
  - Designed on-the-fly message compression schemes in MVAPICH2-GDR
  - Accelerated point-to-point communication performance of transferring large GPU-to-GPU data
  - Compression algorithm for floating-point data, integrated to MVAPICH-GDR
    - MPC: Lossless, high throughput
    - ZFP: Lossy, high throughput
  - Weak scaling of AWP-ODC on V100 nodes with IB EDR
    - MPC-OPT achieved +18% flops, or -15% runtime
    - ZFP-OPT achieved +35% flops, or -26% runtime

(Q. Zhou et al., IPDPS’21)
Performance Evaluation on Lonestar-6

- 48%-64% benefits using on-the-fly MPC compression using MPC over GDR
- Combined MVAPICH2-GDR enhancement over IMPI, including both CUDA-aware support and on-the-fly compression, improves application performance by 125%, 97%, 137% and 154% on 2, 4, 8 and 16 nodes, respectively
AWP-ODC software Engineering challenges and Opportunities
Challenges for United and Continued Software Development

- Inexact computing is required for reducing energy consumption
- Application level can tolerate a degree accuracy, e.g. discontinuous mesh, error tolerance and precision reduction
- AWP-ODC is highly efficient for regional earthquake simulation and physics-based seismic hazard analysis

AWP-ODC power consumption study on Perlmutter (Govind, 2023)
Summary and Outlook

- AWP-ODC is accelerated with enhanced MVAPICH library on both CPU and GPU architectures
- We see 154% benefits over IMPI in MVAPICH2-GDR with CUDA-aware support and on-the-fly compression for AWP-ODC on 16 Lonestar6 A100 nodes, future plan is to apply these benefits to Iwan and CyberShake SGT codes
- The Iwan model introduces 20-30x more computation and 5-13x more memory consumption when compared to linear solution, a major challenge for software engineering
- A joint project with NOWLAB will address load-aware design for MPI asynchronous communication, application-aware neighborhood collective communication, and partitioned point-to-point primitives for efficient communication and cross runtime coordination for MPI+X
- Ongoing NSF CSA project is preparing AWP-ODC for NSF next generation LCCF Horizon to be deployed at TACC – with a hybrid approach using CPUs for dynamic rupture simulation, and GPUs for Iwan-DM wave propagation simulation
- 3D ground motion at 8 Hz or higher is required to realistically capture the full dynamics of a potential Big One on the San Andreas fault
Acknowledgments

Daniel Roten
Akash Palla
Anish Govind
Philip Maechling
Scott Callaghan
Kim Olsen
Lars Koesterke
DK Panda
Hari Subramoni
Sreeram Potluri
Qinghua Zhou

Computing Allocation
OLCF DD, TACC LSCP and CSA, ACCESS Delta, SDSC Expanse, AMD AAC, DOE INCITE & ALCC

Funding
NSF LCCF/CSA, NSF CSSI, NSF/USGS SCEC Core, SDSC Core
## Lonestar-6 Network

<table>
<thead>
<tr>
<th></th>
<th>GPU0</th>
<th>GPU1</th>
<th>GPU2</th>
<th>NIC0</th>
<th>CPU Affinity</th>
<th>NUMA Affinity</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU0</td>
<td>X</td>
<td>SYS</td>
<td>SYS</td>
<td>NODE</td>
<td>0-63</td>
<td>0</td>
</tr>
<tr>
<td>GPU1</td>
<td>SYS</td>
<td>X</td>
<td>NODE</td>
<td>SYS</td>
<td>64-127</td>
<td>1</td>
</tr>
<tr>
<td>GPU2</td>
<td>SYS</td>
<td>NODE</td>
<td>X</td>
<td>SYS</td>
<td>64-127</td>
<td>1</td>
</tr>
<tr>
<td>NIC0</td>
<td>NODE</td>
<td>SYS</td>
<td>SYS</td>
<td>X</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Legend:**

- **X** = Self
- **SYS** = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
- **NODE** = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
- **PHB** = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
- **PXN** = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
- **PIX** = Connection traversing at most a single PCIe bridge
- **NV#** = Connection traversing a bonded set of # NVLinks

**NIC Legend:**

- **NIC0:** mlx5_0
Expanse Network

```
[[yfcui@exp-16-57 ~]$ nvidia-smi topo -m

<table>
<thead>
<tr>
<th></th>
<th>GPU0</th>
<th>GPU1</th>
<th>GPU2</th>
<th>GPU3</th>
<th>NIC0</th>
<th>NIC1</th>
<th>NIC2</th>
<th>CPU Affinity</th>
<th>NUMA Affinity</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU0</td>
<td>X</td>
<td>NV12</td>
<td>SYS</td>
<td>SYS</td>
<td>NODE</td>
<td>NODE</td>
<td>SYS</td>
<td>0-15</td>
<td>0</td>
</tr>
<tr>
<td>GPU1</td>
<td>NV12</td>
<td>X</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>NODE</td>
<td>16-31</td>
<td>1</td>
</tr>
<tr>
<td>GPU2</td>
<td>SYS</td>
<td>SYS</td>
<td>X</td>
<td>NV12</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>48-63</td>
<td>3</td>
</tr>
<tr>
<td>GPU3</td>
<td>SYS</td>
<td>SYS</td>
<td>NV12</td>
<td>X</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>48-63</td>
<td>3</td>
</tr>
<tr>
<td>NIC0</td>
<td>NODE</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>X</td>
<td>PIX</td>
<td>SYS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>NIC1</td>
<td>NODE</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>PIX</td>
<td>X</td>
<td>SYS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>NIC2</td>
<td>SYS</td>
<td>NODE</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>SYS</td>
<td>X</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Legend:

X      = Self
SYS    = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE   = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB    = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PIX    = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
NV#    = Connection traversing at most a single PCIe bridge

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
```