M.Sc Thesis


M.Sc StudentMaudlej Lina
SubjectAccelerators Management In Disaggregated Computing System
DepartmentDepartment of Electrical and Computer Engineering
Supervisor ASSOCIATE PROF. Mark Silberstein


Abstract

It is likely that next-generation disaggregated heterogeneous data centers will be composed of a wide range of computing graphics processing units (GPUs), tensor processing units (TPUs), field programmable gate arrays (FGPAs), visual computer accelerators (VCAs), SmartSSDs, and input/output SmartNICs (accelerators grouped together in shared resource pools). They will serve as the basis for new computer system designs, which will leverage fine-grained components combined on demand to build complex applications. However, existing systems lack the operating system (OS) support for this new design. Specifically, although central processing unit (CPU) cycles have become a precious commodity, existing deployments use traditional CPU-centric designs where accelerators are managed solely by the CPU, thus causing significant CPU resource consumption.

To mitigate these limitations, we first describe a SmartNIC-offloaded solution, Lynx, where accelerator management is performed by a SmartNIC. Lynx extends SmartNICs to implement the model wherein accelerators interact directly with each other, bypassing the host CPU. We show the manner in which Lynx can manage a VCA accelerator. We also show that this design improves the system latency by 4.3x as compared to the traditional approach.

Second, we present a GPU service with FractOS. FractOS is a disaggregated OS that exposes high-level secure interfaces to allow remote operations, replacing low-level hardware-oriented device application programming interfaces (APIs). In FractOS, we adopt Lynx's first approach by allowing FractOS's interfaces to interact directly with devices without CPU involvement by offloading the critical path to SmartNICs. FractOS's invocation interface enables any device to invoke control operations on any remote system resource without running a low-level driver.

We evaluate the FractOS GPU service on a FractOS controller that runs on host CPUs or SmartNICs. We show that the overheads of our GPU service prototype are relatively small and the SmartNIC deployment of the FractOS controller is still faster than the state-of-the-art rCUDA remote GPU service. We also show that we achieve near-optimal throughput with more than one in-flight request.

These results pave the way to using SmartNICs for accelerator management in modern disaggregated data centers.