טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentShor Tal
SubjectSciLMM: Computing Heritability with Millions of
Individuals
DepartmentDepartment of Computer Science
Supervisor Professor Dan Geiger


Abstract

The rapid digitization of genealogical and medical records enables the assembly of extremely large pedigree records spanning millions of individuals. Such pedigrees provide the opportunity to answer genetic and epidemiological questions in scales much larger than previously possible. Linear mixed models (LMMs) are often used for analysis of pedigree data for a higher precision than simple regressions. However, LMMs cannot naturally scale to large pedigrees spanning millions of individuals, owing to their steep computational and storage requirements. Here we propose a novel modeling framework called Sparse Cholesky factorIzation LMM (SciLMM), that alleviates these difficulties by exploiting the sparsity patterns found in large pedigree data. The proposed framework can construct a matrix of genetic relationships between billions of pairs of individuals in several hours, creating robust features for the Haseman-Elston regression (an efficient, simple regression), and can fit the corresponding LMM in several days, culminating in precise estimators and their confidence interval. We demonstrate the capabilities of SciLMM via simulation large pedigrees and by estimating the heritability of longevity in a very large pedigree spanning millions of individuals and over five centuries of human history (published by GENI). The SciLMM framework enables the analysis of extremely large pedigrees that was not previously possible.