Ph.D Thesis

Ph.D StudentSharon Itai
SubjectComputational Methods for Metagenomic Analysis
DepartmentDepartment of Computer Science
Supervisors PROF. Oded Beja
PROF. Ron Pinter
Full Thesis textFull thesis text - English Version


Metagenomics is a new field in which genetic material is extracted directly from the environment and is subsequently analyzed by a variety of biological and computational methods. Using metagenomics it possible to study the vast majority of microbes on earth, more than 99% of all microbial species according to some estimates, that cannot be cultured in the laboratory. Metagenomic data usually consists of many short (100-1,000bp) DNA sequences, potentially originating from all organisms in the examined environment. Many metagenomic projects have been carried out in recent years, projects that have broadened our understanding of biological processes in a way that was impossible heretofore. On-going and new projects, such as the Global Ocean Sampling (GOS) expedition, promise that the flux of discoveries will increase in the coming years.

In my PhD I chose to focus on two aspects of metagenomics analysis: (i) the statistics of functional analysis of metagenomes, and (ii) the study of genes and gene organizations from metagenomic data. The viewpoint of the first part is global: given a metagenome, we are interested in studying functional characteristics of organisms living in the examined environment which may hint us as for conditions most important in that environment. Based on the Lander-Waterman model for whole genome shotgun sequencing projects we were able to provide a statistical model that accurately estimates the expected number of sequences containing some part of a gene in a metagenome. The statistics of pathways is also discussed: in this case a different model was required that will take into account the possibility of genes that participate in more than one pathway.

The second part of this work takes a "local" view: rather than looking at microbial communities in general, we are interested in answering specific questions on specific genes or systems. This part begins with the description of our discovery of Photosystem-I (PSI) gene cassettes on viral genomes. Using metagenomic data from the Global Ocean Survey (GOS) expedition and the Northern Line Islands we were able to show that a gene cassette of eight PSI genes, potentially sufficient for coding all necessary proteins of fully functional PSI, is present on DNA sequences of viral origin. In this work we used several computational tools that were developed by me, some of them novel to this work while others were also used in other works.