|M.Sc Student||Shahar Sagi|
|Subject||Efficient I/O operations on GPGPU devices|
|Department||Department of Electrical Engineering||Supervisor||Professor Mark Silberstein|
|Full Thesis text|
Modern discrete GPUs have been the processors of choice primarily for compute-intensive applications, but using them in large-scale data processing is extremely challenging, especially when an algorithm exhibits unpredictable, data-driven access patterns.
One of the challenges of implementing such algorithms on GPUs is the fact that they do not provide important I/O abstractions that have long been established in CPU context, like memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory hardware does not support page faults and lacks the ability to modify memory mappings for a running GPU kernel.
In this work, we implement ActivePointers, a software address translation layer that introduces native support for page faults and virtual address space management to GPU programs.
We integrate our system with the GPUfs paging system to provide a fully functional memory mapped files abstraction on commodity GPUs. To access a file mapped into GPU memory developers use active pointers, which behave like regular pointers, but under the hood, access the GPU page cache and trigger page faults handled on the GPU.
To make the implementation efficient, we design and evaluate a number of novel mechanisms, such as a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. In addition, we propose several modifications to the GPUfs paging system implementation which shows performance increase of 5.6X on average over the original GPUfs implementation for a real world application.
We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the whole file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application’s runtime, while enabling the speedups of up to 3X over a combined CPU? implementation and 3.5X over 12-core CPU-only run.
In this work, we show the feasibility of a GPU-centric virtual memory management design with page fault handling and address space modification from GPU programs. We demonstrate both the system’s ease of use for the programmer and the low overhead for address translation. This low overhead is achieved thanks to (1) a co-design of a page cache and a translation mechanisms which enables safe caching of virtual-to-physical mappings in per-thread hardware registers, and (2) GPU inherent latency hiding capabilities which hide the translation overheads.