#### MVAPICH2 with Dual Rail 3D Torus Support on SDSC Data Intensive Gordon Cluster

Mahidhar Tatineni (mahidhar@sdsc.edu)

Amit Majumdar (<u>majumdar@sdsc.edu</u>)

MVAPICH2 User Group Meeting

August 26 , 2013

Ref: Work by Pietro Cicotti, Robert Sinkovits, and Mahidhar Tatineni for Gordon acceptance benchmarking.





## Gordon – Data Intensive Supercomputer

- Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others
- Emphasizes memory and IO over FLOPS.
- Appro integrated 1,024 node Sandy Bridge cluster
- 300 TB of high performance Intel flash
- Large memory supernodes via vSMP Foundation from ScaleMP
- 3D torus interconnect from Mellanox
- In production operation since February 2012
- Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)













# The Memory Hierarchy of a Typical Supercomputer







SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO  $\mathbb{UC}$ 

# The Memory Hierarchy of Gordon







SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

# **Gordon Design Highlights**



SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO UC SanDiego



#### **Gordon Architecture: 3D Torus of Switches**



- Switches are connected in 4x4x4 3D torus
- Linearly expandable
- Short Cables- Fiber Optic cables
  generally not required
- Lower Cost :40% as many switches, 25% to 50% fewer cables
- Works well for localized communication
- Fault Tolerant within the mesh with 2QoS Alternate Routing
- Fault Tolerant with Dual-Rails for all routing algorithms
- Two rails i.e., two complete tori with 64switch nodes in each torus
- Maximum of 6 hops

Switches are interconnected by 3 links in each +/- x, y, z direction





#### **Torus Node IB Networking**



SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO UC SanDiego

# Data Oasis Heterogeneous Architecture Lustre-based Parallel File System







# Data Oasis Performance







SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

# **MVAPICH2 on Gordon**

- MVAPICH2 [Current version is 1.9] is the default MPI implementation on Gordon.
- Compiled with --enable-3dtorus-support flag. Multi-rail support is also in place.
- LIMIC2 [Current version on system is 0.5.6]
- SSDs on Gordon are in I/O nodes. Exported to the compute nodes via iSER. Rail 1 (mlx4\_1) is used for this part.
- I/O nodes also serve as lustre routers. Again I/O traffic is going on rail 1 (mlx4\_1).
- Given I/O traffic, both to lustre and SSDs (local scratch) can saturate rail 1, default recommendation is to run MVAPICH2 with one rail [MV2\_IBA\_HCA=mlx4\_0, MV2\_NUM\_HCAS=1]





#### InfiniBand Bandwidth Performance

Half-duplex IO bandwidth for each of a Compute nodes QDR InfiniBand channels.

Rail 0

(on-board)

IB half-duplex speed (MB/s)

3,830

3,883

3,867

7.971

Rail 1

(add-on)

Full duplex bandwidth between Compute nodes using a single QDR InfiniBand channel.

|       |           | IB full-duplex speed (MB/s) |                    |  |  |  |
|-------|-----------|-----------------------------|--------------------|--|--|--|
|       |           | Rail 0<br>(on-board)        | Rail 1<br>(add-on) |  |  |  |
| 2 250 | Min       | 6,613                       | 5,746              |  |  |  |
| 3,250 |           |                             |                    |  |  |  |
| 3,380 | Max       | 7,515                       | 6,457              |  |  |  |
| 3,376 | Avg.      | 7,505                       | 6,388              |  |  |  |
| 9.508 | Std. Dev. | 35.19                       | 71.47              |  |  |  |
| 9.508 |           |                             |                    |  |  |  |

The add-on IB bandwidth performance is limited by the PCIe riser card design which is based on the Gen2 spec.



Min

Max

Avg. Std Dev.



#### Latency Performance

- The average (ping-pong/2) latency between pairs of compute nodes (total of 1024) was measured with the Intel ping-pong benchmark.
- This includes the software latency, driver latency, HCA firmware latency, and up to a maximum of five switches.
- The test was run from nodes on all switches to four random nodes throughout the torus ensuring that the maximum number of switch hops was included.

|           | Latency (µs) |        |  |
|-----------|--------------|--------|--|
|           | Rail O       | Rail 1 |  |
| Min       | 1.03         | 1.67   |  |
| Max       | 1.85         | 2.57   |  |
| Avg.      | 1.44         | 2.16   |  |
| Std. Dev. | .168         | .177   |  |





#### Full-duplex Any-Any Compute Node IB Bandwidth

- Full-duplex bandwidth between Compute nodes using two QDR InfiniBand channels operating in parallel.
- No switch congestion as only two nodes running the test at any given time.

| testNode#1:testNode#2 | IB rate, MB/s |  |  |  |
|-----------------------|---------------|--|--|--|
| gcn-19-22:gcn3-47     | 10,251        |  |  |  |
| gcn-19-64:gcn-5-54    | 10,400        |  |  |  |
| gcn-2-71:gcn-4-21     | 10,424        |  |  |  |





#### Inter-switch Link Performance

- Performance of inter-switch links was measured.
- With 3 links between each pair of switches and 2 rails, the expected half duplex performance is expected to be approximately three times the sum of rail 0 and rail 1 half-duplex rates = (3x(3.4+3.8)=21).

|           | IB rate, MB/s |
|-----------|---------------|
| Min       | 19,224        |
| Max       | 20,737        |
| Avg.      | 20,456        |
| Std. Dev. | 160.7         |

#### Aggregated measured bandwidth of inter-switch links





#### Dual Rail QDR vs FDR OSU Bandwidth Test

• MVAPICH2 out of the box without any tuning



\*Tests done by Glenn Lockwood (SDSC)





SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO  $\mathbb{U}$ 

#### Dual Rail QDR vs FDR OSU Bandwidth Test

#### • MV2\_RAIL\_SHARING\_LARGE\_MSG\_THRESHOLD=8k



\*Tests done by Glenn Lockwood (SDSC)





SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO ~~ U

#### Dual Rail QDR vs FDR OSU Bandwidth Test

- MV2\_SM\_SCHEDULING=ROUND\_ROBIN
- In new version this is MV2\_RAIL\_SHARING\_POLICY, default



\*Tests done by Glenn Lockwood (SDSC)





## **HPCC** performance

A comprehensive set of HPCC benchmarks were run spanning four different conditions to determine the impact of core count and interconnect properties on performance.

- 1. Single rail torus + 16 cores/node
- 2. Double rail torus + 16 cores/node
- 3. Single rail torus + 12 cores/node
- 4. Double rail torus + 12 cores/node

Also ran a 512-core benchmark using a non-contiguous set of nodes to investigate the impact of communications contention on performance





## **HPCC** performance

| Cores  | G-HPL   | G-<br>PTRANS | G-FFTE  | G-Rand<br>Access | G- Triad | EP-<br>Triad | EP-<br>DGEMM | Rand<br>Ring<br>BW | Rand<br>Ring<br>Lat | HPL<br>%peak |
|--------|---------|--------------|---------|------------------|----------|--------------|--------------|--------------------|---------------------|--------------|
|        | Tflop/s | GB/s         | Gflop/s | Gup/s            | GB/s     | GB/s         | Gflop/s      | GB/s               | μs                  | %            |
| 128    | 2.25    | 27.00        | 68.3    | 0.904            | 598      | 4.67         | 19.34        | 0.374              | 4.3                 | 84.5         |
| 256    | 4.50    | 57.00        | 131.4   | 1.595            | 1178     | 4.60         | 19.30        | 0.345              | 5.6                 | 84.5         |
| 512    | 8.77    | 28.2         | 161.4   | 2.706            | 2350     | 4.59         | 19.41        | 0.156              | 7.1                 | 82.4         |
| 1,024  | 17.02   | 83.9         | 254.6   | 4.316            | 4690     | 4.58         | 19.21        | 0.091              | 8.1                 | 79.9         |
| 2,048  | 29.05   | 177.0        | 465.3   | 7.073            | 9359     | 4.57         | 19.38        | 0.097              | 8.6                 | 68.2         |
| 16,160 | 284.5   | 219.7        | 679.0   | 15.751           | 55590    | 3.44         | 18.86        | 0.021              | 18.1                | 84.6         |

- 128-2048 core: dual-rail, 12 cores/node
- 16160 core: single rail, 16 core/node





# HPCC performance – bandwidth sensitive tests



#### **Random ring bandwidth**



Bandwidth sensitive tests show better performance when using both rails.

Also see improvements from using 12 rather than 16 cores/node





#### **HPCC** performance – latency sensitive tests



**Random Access** 

R-Access-12c-dr R-Access-16c-sr R-Access-16c-dr Latency sensitive tests show better performance when using both rails.

Also see improvements from using 12 rather than 16 cores/node



**Random ring latency** 



SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

## **HPCC performance – contention**

To assess the impact of communication contention, two 512-core HPCC runs were made in which the cores were:

- 1. located on contiguous nodes within 2 neighboring subracks
- 2. spread across cluster using 3 compute nodes per switch (equal to number of inter-switch links)



#### **Contention reduction**

Note: The Gordon scheduler allows users to specify the number of switch hops to control job placement.



SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO UC SanDiego

#### **HPCC** performance – summary

- Achieved 82-85% of peak for Linpack runs at smallest and largest core counts.
- Benchmarks dependent on memory bandwidth (STREAMS, DGEMM) do best when running with 12 cores/node.
- Benchmarks dependent on network performance (PTRANS, FFT, random ring/access) do best when using both rails.
- Minimizing contention results in 40% better performance on bandwidth sensitive HPCC benchmarks.
- Current SDSC scheduler allows users to specify maximum switch hops allowed for a job. Modifications for topology aware scheduling are being planned.





## Summary

- Production Gordon stack features MVAPICH2 w/ --enable-3dtorussupport flag and dual rail support.
- Dual rail QDR performance competitive with FDR performance. MVAPICH2 environment variables such as MV2\_RAIL\_SHARING\_LARGE\_MSG\_THRESHOLD and MV2\_RAIL\_SHARING\_POLICY (earlier MV2\_SM\_SCHEDULING) can be used to tune performance.
- Gordon has oversubscription of switch to switch links. Spreading tasks to reduce contention can improve performance.
- Research ongoing to use topology aware scheduling to improve application performance on Gordon.
- Big Thank You to Dr. Panda's group! Gordon was the first production dual rail InfiniBand 3-D torus machine and the MVAPICH2 deployment was flawless out of the box.
- Acknowledgements : NSF Grant #0910847 (Gordon), #1147926 (SI2), and #0926692(STCI).

