טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentTzahor Shani
SubjectComputational Approaches for Classifying and
Predicting Proteins from the Environment
DepartmentDepartment of Biotechnology
Supervisors Professor Yael Mandel-Gutfreun
Professor Oded Beja
Full Thesis textFull thesis text - English Version


Abstract

Cyanobacteria of the genera Synechococcus and Prochlorococcus play a key role in marine photosynthesis, which contributes to the global carbon cycle and to the world oxygen supply. Recently, genes (psbA-encoding the D1 protein, psbD-encoding the D2 protein, and others) that encode the photosystem II (PSII) reaction center were found in cyanophage genomes. This phenomenon suggested that horizontal transfer, and a possible recombination of psbA genes between Prochlorococcus and Synechococcus, was achieved via phage intermediates. Gaining the psbA gene could be very valuable for phage survival. The transcription and activation of the phage psbA gene during the infection period helps to maintain photosynthetic activity in the host after the shutdown of host protein synthesis. Coupling genomic data extracted directly from the environment to its species origin is necessary for a better understanding of the phage-host relationship and co-evolution.

Within the framework of my thesis, I developed a computational approach combining a Support Vector Machine (SVM) with a Position Specific Scoring Matrix (PSSM) that successfully classifies different psbA fragments coming directly from the ocean. The method combines two basic feature levels: DNA oligonucleotide composition, and amino acid codon usage. Initially the method was trained on psbA fragments from culture bacteria and their phages, and tested on psbA fragments from the Global Ocean Sampling (GOS) expedition. Later, in order to broaden the representation of psbAs from different sources in the training set with the purpose of classifying unknown metagenomics fragments, we decided to include set fragments from the GOS expedition in our training. The enlarged dataset could be divided into the following seven different taxonomic groups:

1.        Synechococcus.

2.        Synechococcus-like Myovirus.

3.        Synechococcus-like Podovirus.

4.        HL Prochlorococcus.

5.        LL Prochlorococcus.

6.        Prochlorococcus-like Myovirus.

7.        Prochlorococcus-like Podovirus.

We further expanded our previous binary method to a multiclass problem applying a One Versus All approach. Finally, the method was applied for the classification of psbA fragments collected from the eastern Mediterranean Sea. Overall, we succeeded in showing that bacteria differ from their phages in oligonucleotide composition. Our results emphasize that some evolutionary and beneficial reasons cause phages to use different compositions of oligonucleotides for the psbA gene. These characteristics could be applied successfully for the classification of short fragments coming directly from the environment.