טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBrokhman Tanya
SubjectAn OS Page Cache for Heterogeneous Systems
DepartmentDepartment of Computer Science
Supervisor Professor Mark Silberstein


Abstract

Efficient access to files from GPUs is of growing importance in data-intensive applications. Unfortunately, current OS design cannot provide core system services to GPU kernels, such as efficient access to memory-mapped files, nor can it optimize I/O performance for CPU applications sharing files with GPUs. To mitigate these limitations, much tighter integration of GPU memory into the OS page cache and file I/O mechanisms is required. Achieving such integration is one of the primary goals of this thesis. We propose a principled approach to integrating GPU memory with an OS page cache. GAIA extends the CPU OS page cache to the physical memory of accelerators to enable seamless management of the distributed page cache (spanning CPU and GPU memories) by the CPU OS. We adopt a variation of CPU-managed lazy relaxed consistency shared memory model while maintaining compatibility with unmodified CPU programs. We highlight the main hardware and software interfaces to support this architecture, and show a number of optimizations, such as tight integration with the OS prefetcher, to achieve efficient peer-to-peer caching of file contents. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA:

1.      Integrates lazy release consistency among physical memories into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels.

2.      Improves CPU I/O performance by using data cached in GPU memory.

3.      Optimizes the readahead prefetcher to support accesses to caches in GPUs.

We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3?speedupinCPU file I/O and up to 8?inunmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.