M.Sc Thesis

M.Sc StudentDaoud Feras
SubjectHigh Performance Low Latency Network Over GPU
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROF. Mark Silberstein
Full Thesis textFull thesis text - English Version


We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5µsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5? thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications - ping-pong and a multi-matrixvector product with constant matrix and multiple vectors - each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5% higher performance than the baseline using GPI-2. The improved pingpong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5? faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.