טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMalits Roman
SubjectThe Potential of Global Scheduling to Improve Utilization
in Wide SIMD GPGPU Architectures
DepartmentDepartment of Electrical Engineering
Supervisors Professor Avi Mendelson
Professor Emeritus Avinoam Kolodny
Mr. Evgeny Bolotin
Full Thesis textFull thesis text - English Version


Abstract

Modern Graphic Processing Units (GPUs) replaced much of the early generations fixed function rendering pipelines with programmable shader processors that run a relatively small and simple shader program. The shader processors of the latest generations of GPUs are Turing-complete which prompted the appearance of more generic and highly programmable General Purpose GPUs. New programming models such as CUDA and OpenCL emerged opening GPGPU usage for use by general purpose applications. Under this new programming model GPGPU is used as a co-processor to CPU, serial portions of the applications run on the CPU while the data parallel portions of the code are executed by the GPGPU.

Many general purpose data parallel applications are characterized as having intensive control flow and unpredictable memory access patterns, for example parallel graph algorithms. Optimizing the code in such problems for current GPGPU hardware is often impractical resulting in a low hardware utilization and consequently low performance. 

This work tracks the root causes of execution inefficacies in NVIDIA GPGPU hardware running control flow intensive CUDA applications.  Modern GPGPUs use multiple streaming multiprocessors (SMs) for parallel execution with each SM responsible for a portion of the work. SMs use SIMD instruction scheduling which allows to efficiently handle applications with extensive data level parallelism by amortizing data-independent control hardware across multiple pipelines. Under this approach, multiple data elements are grouped and processed in “lock-step” in parallel with a single instruction. Such thread groups are referred to as warps by NVIDIA.

Current hardware employs scheduling optimizations which are local to each SM.  This work unveils the potential of alternative approach based on global warp scheduling and reconstruction. For accessing this hidden potential we implement and evaluate a novel global warp scheduling algorithm, termed DGS (Dynamic Global Scheduler) designed to maximize the amount of effective parallelism in the system.

We show by simulations of synthetic and real CUDA benchmarks that DGS can significantly improve the machine utilization without the need for time consuming software optimizations. For that reason DGS have a significant potential for substantial performance improvement in control flow bound applications compared to current and previously proposed scheduling mechanisms, when no fine grained optimizations are employed.

The paper shows that for MUMmerGPU and BFS parallel graph algorithms DGS achieves performance gain of up to x4.4 and x2.6 relative to Immediate Post Dominator (POD) scheduling method which represents the way scheduling works in current hardware.