Ph.D Thesis

Ph.D StudentSteinfeld Israel
SubjectData Analysis in Studies Combining Multiple
High-Throughput Measurement Technologies
DepartmentDepartment of Computer Science
Supervisor ASSOCIATE PROF. Zohar Yakhini
Full Thesis textFull thesis text - English Version


In the studies presented here, we demonstrate the use of rigorous statistics and efficient algorithmics to analyze integrated high-throughput molecular measurement data by translating results into ranked lists and by using flexible approaches to statistically assess properties of the latter. 

Over recent years, modern biology has undergone an information revolution, which is evident in a shift of thinking and practice.  While typical biological studies are focused on specific pathways, like the p53 signaling pathway, the emergence of novel high-throughput technologies now enables the quantification of biological features in a genome-wide scale.  The rapid development in array technology, in particular, enabled its utilization in the measurements of mRNA expression levels, miRNA expression levels, DNA methylation state, DNA copy number aberration in cancer, etc.  With the recent revolution in second generation sequencing technologies the accuracy and scope of the different high-throughput applications is constantly being improved.

The naïve approach to analyzing such rich high-throughput data is to separately cluster the samples and genes, usually using hierarchical clustering, and to try to characterize each cluster, commonly with the help of gene annotation repositories (e.g. Gene Ontology).  With an exhaustive expert examination, this methodology can yield meaningful biological results, though intricate responses are difficult to uncover.  Various methodologies have been developed, in recent years, to handle integrated analysis of functional genomics data, mainly by studying the transcriptional programs and global organization of biological processes.  Still, only a few studies report the joint analysis of sample cohorts that include multiple genomic measurements. 

In my thesis we explore the notion of gene set enrichment that is broadly used in analyzing genomic high-throughput results.  We generalize enrichment statistical approaches to the current needs of molecular genomics, introducing the notion of mutual enrichment.  In addition, we study the joint analysis of integrated high-throughput data in the context of matched samples.  In such data, each sample (e.g. cell-line, subject or patient) is associated with multiple genome-wide profiles and other high-throughput information.  Applying our newly developed methodologies to measurement data, we are able to characterize biological response at the system level, capturing not only the dominant primary responses but also finer and less-easily tractable processes.