טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentErez Ben-Yaacov
SubjectAnalysis of Microarray Data
DepartmentDepartment of Electrical Engineering
Supervisor Full Professor Eldar Yonina
Full Thesis textFull thesis text - English Version


Abstract

Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in a single experiment. Assuming there was no noise in our microarray experiment, we would have expected to see as a result a piecewise constant signal, and breakpoints would indicate the start and end locations of new genes, or the start and end locations of copy number variations. Since microarrays measurements are noisy, segmentation of the measurements of a given genomic profile into a piecewise constant signal is a key step in the analysis of tiling microarray data. It involves reliable identification of locations with copy number transitions or breakpoints, by making an efficient use of the physical dependency of adjacent probes. In this work we review leading 1-D segmentation approaches used in microarray analysis, and present HaarSeg, a new segmentation method, based on well known wavelet denoising principles. HaarSeg identifies statistically significant breakpoints in the data, using the maxima of the Haar wavelet transform, and segments accordingly. Leading 1-D segmentation methods suffer from very long running times, preventing interactive data analysis. HaarSeg is over 1,000 times faster than these leading approaches, with similar performance. Another key advantage of the proposed method is its simplicity and flexibility. Due to its intuitive structure it can be easily generalized to incorporate several types of side information. We consider several extensions, such as the inclusion of side information indicating the reliability of each measurement, and compensating for a changing variability in the measurement noise. The resulting algorithm outperforms existing methods, both in terms of speed and performance, when applied to real high density aCGH data.