|M.Sc Student||Watad Amir|
|Subject||Scalable Multi-GPU Network Servers|
|Department||Department of Electrical Engineering||Supervisor||Professor Mark Silberstein|
|Full Thesis text|
In this work we explore the design space for network servers on GPUs. First, we show a simple logic single-GPU (or multiple independent GPUs) network server connected via the network to a memcached key-value store.
We then address a more complex server, with multi-stage logic, multiple GPUs and sharded (partitioned) dataset. We present GPUpipe, an abstraction and runtime for implementing such servers.
GPUpipe is a framework for building low-latency multi-GPU network servers for memory-demanding applications. GPUpipe introduces a CPU-less server design which eliminates GPU management overheads, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores.
GPUpipe builds on the concept of a data parallel pipeline, providing convenient GPU programming abstractions which hide the complexity of building multi-GPU systems.
We evaluate GPUpipe by implementing a fully functional Image Similarity Search server. Our experiments on local and Amazon EC2 systems show that the server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Our experiments confirm that the server performance is insensitive to CPU characteristics: it maintains the same throughput with one CPU core, achieving over an order of magnitude higher throughput than the CPU-driven server in such scenario.