|M.Sc Student||Ghanayim Alaa|
|Subject||Iterative Referencing for Improving the Interpretation|
of DNA Sequence Data
|Department||Department of Computer Science||Supervisor||Professor Dan Geiger|
|Full Thesis text|
Next-Generation Sequencing (NGS) facilitates genetic studies to discover SNPs and indels associated with Mendelian and complex diseases. The measurement process, which generates millions of short reads, creates various data processing and interpretation challenges for which a multitude of software tools are being developed. A common framework used to date to discover variations in sequenced data includes the following steps in a pipeline: First, mapping the sequenced reads to some reference the genome. Second, local realignment to account for indels. Third, recalibration of the base quality score and discovering variations by using the GATK or SAMTOOLS packages. Improving the accuracy of discovering true SNPs and indels in analyzing NGS data requires improved tools and capabilities in each step of this pipeline. The mapping accuracy is the bottleneck of this process because reads mapped incorrectly dramatically increase false positive rates of discovered variations and lower the rate of detected true positive variations. We present a revised approach, Iterative Referencing (IR), that increases the accuracy of mapping the sequence data by iteratively improving the reference genome via the Expectation maximization (EM) algorithm. The idea is that if sufficient coverage in the sequenced data contain a specific homozygous SNP that is not seen in the reference genome then the reference genome should be altered to contain that SNP. Such a situation occurs when a reference genome is used that is not close enough to the population under study. In each step, the EM algorithm discovers the new homozygous variations in the whole genome, then alters the reference genome by replacing the reference bases by the alternative bases and builds a new reference genome that is more appropriate to the population under study. The results demonstrate that using the updated reference genome improves the alignment process up to 6%, increases the rate of true variations by up to 3%, decreases the rate of false variations up to 3%, all measured with respect to the original reference model.