

# Use of BlueField SmartNICs in Offloading One-Sided **Communication Primitives**

Benjamin Michalowicz<sup>1</sup>, Kaushik Kandadi Suresh<sup>1</sup>, Hari Subramoni<sup>1</sup>, Dhabaleswar K. Panda<sup>1</sup>, Steve Poole<sup>2</sup> <sup>1</sup>The Ohio State University, <sup>2</sup>Los Alamos National Laboratory {michalowicz.2, kandadisuresh.1, subramoni.1, panda.2}@osu.edu, swpoole@lanl.gov



#### MOTIVATION

- Two-Sided Communication has been successfully offloaded to SmartNICs such as NVIDIA's BlueField-2 and BlueField-3 (BF-2/3)
- One-Sided Communication (1SC) is inherently nonblocking, which can leverage SmartNICs to gain more overlap between communication and compute.
- 1SC is used across multiple programming libraries/implementations (MPI, PGAS/OpenSHMEM, etc.)
  - We focus on MPI and OpenSHMEM

## RESEARCH CHALLENGES

- Designing a library-agnostic framework for offloading 1SC (Put and Get operations)
- Account for different approaches w/ MPI and OpenSHMEM and how the execute 1SC
- Use of low-level, advanced network primitives for efficient, scalable designs on both BF-2 and BF-3 DPUs
  - Emphasis on network and use of memory subsystem and less on advanced systems

## HIGH-LEVEL CHANGES IN HPC LIBRARIES



## SUMMARY OF CONTRIBUTIONS

- Design of a standard/library-agnostic framework for offloading 1SC
- Scalable design at both the benchmark and the application level (BSPMM kernel)
- Design of different synchronization types to account for, e.g., flushing of MPI windows or completing nonblocking 1SC in OpenSHMEM
- Demonstration of efficiency up to 512 Processes with 4 BF-3 DPUs and AMD-Epyc CPUs (128-core)
- Demonstration of scalability up to 256 Processes with 8 BF-2 DPUs and Intel Broadwell CPUs (32-core)
- Results: Up to 24x speedup compared to a blocking kernel (System 1) and up to 99x speedup compared to a non-blocking kernel (System 2)

# DESIGNS AND OMB BENCHMARK PERFORMANCE (VIA BF-2)

# DESIGN: EMPHASIS ON "GET" AND "QUIET/FLUSH", API "INTEGRATION"



- Step 1: Host1 issues nonblocking get, sends metadata to the DPU, increments counter (per-process, perwindow)
- Step 2: DPU utilizes GVMI firmware in BF-2/3 to perform RDMA operations
- Step 3: "Return Data" (in Flush/synchronization)



- used on host side, DPU-proxy counter is used on worker-side to issue fetch-add operations
- Step 5: Each proxy process "returns" to their matched host process





Proxy\_exchange in shmem\_malloc() there to "show" DPUs are aware of the usage of space in the symmetric heap.

#### Intra-Node OSU OpenSHMEM Benchmarks





// Other metadata setup ...

9 MPI\_win\_flush\_all(window){

 $addr2 = buf_of (window) + disp$ 

return Offload\_flush(window);

bytes = count\*get\_size(datatype);

### Inter-Node OpenSHMEM Benchmarks

Inter-Node Get Latency and Overlap (OSHMEM Heap)



- Overhead largely comes from effort to offload small messages. Larger msgs take advantage of overlap (goal of offloading, not as much for improved transfer time) – DPU-aware designs more sensitive to cache compared to pure-host, leads to spike at 128KB on BF-2
- Intra-node comparison: Designs still go over the network, and while shared memory is fast, minimal overlap occurs after 64KB messages – host-based progression will get impacted in more dense execution.

#### BLOCK SPARSE MM KERNEL: BF-2 AND BF-3 PERFORMANCE ("BABY" VARIANT OF NWCHEM)

#### Blocking and Non-blocking variants via OpenSHMEM

- "Get-Compute-Update" Pattern
- Mesh: X, Y parameters, but Y has another "dimension" inside (X rows, X blocks per row, Y cells per block) → Calculate buffer as X\*X\*Y \* (sizeof(double))
- Blocking variant:
  - while (work\_unit!=max\_unit\_count) { blocking\_get(); dgemm(); update();
- Nonblocking variant:
  - blocking\_get(cur\_bufs); while(work\_unit!=max\_unit\_count) { nb\_get(next\_bufs); dgemm(cur\_bufs); sync(next\_bufs); update(); cur\_bufs = next\_bufs;

#### **Experimental Setup**

• Experimental Systems:

Libraries

- System 1: Intel Skylake (20 cores x 2 sockets) w/ BF-2s and HDR100 IB
- System 2: AMD EPYC (64 cores x 2 sockets) w/ BF-3s and HDR200 IB
- MVAPICH2-2.3.7 and OSHMPI Framework
- Offshoot of MVAPICH with 1SC designs and OSHMPI Framework
- OSHMPI Standard-Compliant Framework to have OpenSHMEM be emulated by MPI primitives
  - Symmetric Heap → One MPI Window
  - shmem\_put → MPI\_Put + immediate MPI\_Win\_flush\_all()
  - Etc.

# Performance on System 1 (Comm and Compute Offload)





- Get-Compute-Update lends itself nicely to "naïve" compute offload (utilizing BF-2 cores)
- Dense communication → lack of progress resources available on host = perfect use for DPUs (Smaller scale and PPN results in some benefits, but not as much)
- Up to 91% improvement with BF-2's for both compute and communication offload

#### Performance on System 2 (Comm Offload)

1-node, 128-PPN Results (Nonblocking Kernel)





4-node, 128-PPN Results (Nonblocking Kernel) 4-Node, 128 PPN Results (AMD + BF3) Pure-Host-Nonblocking Offload



at 60x60 mesh (16x16 mesh could not be run at larger scales, hence lack of log-scale y-axis)

References BSPMM Kernel, R. Zambre and S. Bhattacharya, https://github.com/rzambre/bspmn

NVIDIA, "NVIDIA BlueField Networking Platform" https://www.nvidia.com/en-us/networking/products/data-processing-unit/ B. Michalowicz, K. K. Suresh, H. Subramoni, M. Abduljabbar, D. K. Panda and S. Poole, "Effective and Efficient Offloading Designs for One-Sided Communication to SmartNICs," 2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC), Bangalore, India, 2024, pp. 23-33, doi:

Acknowledgements **HPCA-AI Advisory Council** 2. LANL SOW #19537, NSF Grants #231927 and #2007991