Ph.D Thesis


Ph.D StudentMark Silberstein
SubjectA Distributed System for Genetic Linkage Analysis
DepartmentDepartment of Computer Science
Supervisors Full Professors Geiger Dan
Full Professors Schuster Assaf
Full Thesis textFull thesis text - English Version


Abstract

In this work we consider the problem of acceleration of scientific computations via parallel execution using large-scale non-dedicated distributed environments (aka grids) and Graphical Processing Units (GPUs). Our primary motivation has been the acceleration of parametric genetic linkage analysis computations. The main practical outcome is the design and implementation of the distributed system for genetic linkage analysis, called Superlink-online. The system serves hundreds of geneticists worldwide allowing for faster analysis of genetic data via automatic parallelization and execution on thousands of non-dedicated computers .


First, we design a parallel algorithm for computing probability of evidence in large Bayesian networks, which is the computational problem underlying the genetic linkage analysis. The algorithm splits the problem into independent sub-tasks, enabling parallel execution over non-dedicated large-scale grids . Next, we show a number of general mechanisms to schedule and efficiently execute a workload comprising massively parallel and short Bags of Tasks (BOTs) on throughput-optimized non-dedicated resource pools. We demonstrate that a handful of generic policy-driven mechanisms, such as resource matching, task replication and host-specific ranking,  enable implementation of a variety of scheduling algorithms for optimizing different target functions, such as the resource cost, makespan, and BOT relative slowdown. We devise an approximate algorithm for run-time policy evaluation which allows handling millions of jobs, and implement it as a part of a working system, called GridBot. We experimentally demonstrate GridBot's scalability and efficiency for running linkage analysis workloads over nine different grids including community grid with thousands of non-dedicated CPUs , achieving the effective throughput equivalent to 8,000  dedicated CPU cores .


Finally we demonstrate a complimentary approach to accelerating the computations by using GPUs. We introduce a formal approach for handling memory-intensive workloads with complex memory reuse pattern. We show that by applying structured approach to programming the GPU scratchpad (close-to-ALU) memory via software-managed cache, one can create an efficient computational algorithm that uses this cache coupled with the low-overhead policy-driven runtime cache mechanism. We demonstrate that for the workloads where the reuse pattern can be determined at runtime on a CPU , the cache policy can be represented as a lookup table, thus saving the costly execution of cache maintenance logic at runtime. This approach improves the application performance by up to an order of magnitude versus the implementation without the cache. Furthermore, it enables computation of the analytical upper bound for the expected GPU performance for these applications.