טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
Ph.D Thesis
Ph.D StudentNormand Rachelly
SubjectHarnessing Machine Learning to Bridge the Cross-Species
Gap between Mice and Humans
DepartmentDepartment of Medicine
Supervisor Professor Shai Shen-Orr
Full Thesis textFull thesis text - English Version


Abstract

Emerging technologies allow measurement of various biological molecules in unprecedented depth and scale and enables studying organisms as a whole system. The amass data that is publicly available offers an opportunity for the development of algorithms that can leverage it for new discoveries. Machine learning algorithms are particularly appropriate for this task as they can predict and classify desired traits in the data. In this work we applied a machine learning methodology that utilizes public gene expression data from mouse and human samples in order to improve the translation from mouse model experiments to conclusions relevant for human pathology.

Mice are the most widely used and cost-effective model to study human diseases due to unparalleled advantages in terms of speed, flexibility, wealth of reagents and possible genetic manipulations. However, mice differ substantially from humans even at a healthy state. Even though mouse studies are usually a mandatory step before clinical trials, successful therapeutic experiments in mice often fail in human clinical trials. With the inability to perform all experiments in humans, it is expected that mice would remain fundamental for basic and translational research for many years to come. Hence, there is an urgent need to develop methodologies for improving cross-species translational research.

We developed FIT (Found In Translation) a data-driven machine learning model that, given a mouse gene expression experiment, predicts the genes relevant to the analogous human condition. FIT leverages a comprehensive collection of public mouse and human gene expression data that was manually assembled, quality checked, uniformly pre-processed and carefully paired between the species. Based on this compendium and given a new mouse experiment, FIT predicts the human value for every gene, thus providing a new expression vector for the experiment that can be sorted by the relevance to the human condition, without the need of prior human data. We validated FIT by applying it to 170 mouse gene expression datasets form 28 different diseases and comparing its predictions to the equivalent human expression data. We show that FIT can identify more human-relevant differentially expressed genes than direct analysis of the mouse data and may rescue genes that are missed when using the mouse data alone. In addition, we validated a FIT prediction at the protein level, discovering a novel gene that is associated with inflammatory bowel disease.

FIT is novel because it allows researchers to gain insights on what is relevant to the parallel human condition of their own experiment, regardless of the disease, rather than use a general knowledgebase with querying options. FIT runs quickly and is available as an R package as well as a web-tool. Born out of the realization that no new experiment should be analyzed without taking the accumulated wealth of prior information into account, we envision FIT being used for any new mouse model gene expression dataset in order to focus researchers on the genes relevant to the human condition and to reduce false leads.