|M.Sc Student||Dimitsas Vasileios|
|Subject||A Readahead Prefetcher for GPU File System Layer|
|Department||Department of Electrical Engineering||Supervisor||Professor Mark Silberstein|
|Full Thesis text|
Growing popularity of the GPUs in data-intensive workloads, such as deep neural networks, has motivated the research community to optimize the file I/O for GPUs. A step towards this direction is GPUfs, which provides POSIX-like abstractions for GPU applications, enables the GPU threads to issue I/O instructions without the CPU intervention and introduces a GPU page cache.
However, streaming applications in which, a significant portion of the execution time is spent in I/O and their I/O access patterns do not exhibit a high degree of temporal locality, do not achieve their performance potential by being executed in GPUs, for numerous reasons. The architectural characteristics of a heterogeneous CPU-discrete GPU system hinder the applications from utilizing the available SSD bandwidth. Moreover, the Linux Readahead Prefetcher which, traditionally enhances the performance of the CPU applications that perform sequential read I/O, is not sufficient for enhancing the GPU I/O performance.
We present a GPU I/O readahead prefetcher, which is integrated with GPUfs and operates synergistically with the Linux Readahead Prefetcher, in order to improve the performance of streaming applications that perform sequential read I/O accesses. We make an in-depth analysis of the architectural features of the heterogeneous CPU-discrete GPU system that have an impact in the GPU I/O performance and characterize the I/O access patterns which have been observed during the execution of the streaming GPU programs. Finally, we propose a new page cache replacement policy in order to deal with the GPU memory limitations.
We evaluate the GPU I/O readahead prefetcher on an NVIDIA GPU using a series of I/O microbenchmarks which, exhibit the I/O access pattern that has been observed in a diverse set of streaming applications. The GPU I/O readahead prefetcher achieves more than 2x (geometric mean) higher bandwidth than the default case. Furthermore, we use 15 applications for our evaluation, derived from the RODINIA, PARBOIL and POLYBENCH benchmark suites. The moderate use of our prefetching mechanism improves on average (geometric mean), their execution time by almost 50% and their I/O bandwidth by 82% compared to CPU I/O.