Ph.D Thesis | |

Ph.D Student | Weissbrod Omer |
---|---|

Subject | Linear and Generalized Linear Mixed Models for Genetic Case Control Studies |

Department | Department of Computer Science |

Supervisors | PROF. Dan Geiger |

PROF. Rosset Saharon |

Contemporary statistical genetics deals with three major problems: Searching for genetic variants associated with a trait of interest, predicting risk of affection with a genetic trait based on genotypic data, and estimating parameters describing trait etiology, such as heritability (roughly defined as the fraction of trait variance attributed to genetics) or genetic correlation between genetic traits. Genetic diseases are arguably the most important focus of genetic studies, and are typically studied via case control designs which overrepresent cases relative to their population prevalence. Much of the theory, methodology and empirical evidence behind existing solutions to the above problems is motivated by and addresses the modeling of quantitative traits in cohort studies. In this thesis we examine the validity of existing solutions in the presence of case-control ascertainment, and propose novel methodologies to solve all three problems in such settings.

A common modeling assumption shared by all of the works presented is the liability threshold model, which postulates that every individual carries a latent normally distributed variable called the liability, such that cases are individuals whose liability exceeds a given cutoff. This assumption gives rise to a class of models called linear and generalized linear mixed models, which can model complex dependency patterns in large high dimensional data. Our work makes use of these models to solve the three major problems of statistical genetics in the presence of case-control ascertainment.

To solve the association testing problem, we devise an approximate method that first estimates liability and then tests for association between genetic variants and the liability. To solve the risk prediction problem, we devise a data-adaptive statistical learning mechanism that can capture complex non-linear statistical patterns in the data to improve prediction performance. Finally, to estimate parameters of genetic etiology, we extend an existing moment-estimation technique to estimate these quantities directly and via summary statistics, which alleviates privacy and data-sharing concerns. In all cases, we demonstrate that our approaches substantially and consistently improve over existing state of the art methods via extensive computer simulations and analysis of real genetic data.