טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentAdmon Yasmin
SubjectComputational Methods for Addressing Obstacles In Single
Cell Data Analysis Caused by High Heterogeneity
DepartmentDepartment of Medicine
Supervisor ? Shai Shen-Orr
Full Thesis textFull thesis text - English Version


Abstract

Single-cell high throughput measurements have become predominate tools for investigating complex biological systems. A variety of single-cell technologies are available for measuring the genome, transcriptome, proteome and epigenome. Notwithstanding, this abundance of high dimensional data is becoming challenging to interpret, requiring usage of computational tools to characterize, analyze and visualize the results. One main difficulty in analyzing high throughput single cell data arise from its heterogeneity, e.g. consisting of numerous different cell subsets which exhibit large variability in measured parameters across single cells.

Several computational tools as well as common practices for analyzing high dimensional single cell data have been established. Two main steps in the analysis pipeline are cell population identification and differential expression analysis. While existing tools provide solutions to accomplish and compute these steps, they often do not reach full discovery potential due to inability to overcome the hurdles imposed by the high heterogeneity of the data.

In this work, we propose two new computational methods to tackle obstacles opposed by high heterogeneity of single-cell data. The first - Single-cell Samples Populations Mapping (SSPM), is aimed to identify similar cell populations across different samples. SSPM reduces the need for pooling data from several samples for data analysis which may lead to resolution loss and enables combining separately analyzed samples, thus increasing resolution while preserving delicate changes across samples. SSPM uses graph modelling and lightest paths calculation to identify similar cell populations. 

The second method, called single-cell independent Expression Filtering (sci-EF), implements an advanced data driven filtering method which identifies experiment features which are not expressed in a given cell population. Differential expression analysis in single-cell datasets usually tests each feature-cell population pair for significant changes in expression, hence the number of multiple hypothesis is the number of identified cell populations timed number of features. Sci-EF filtering reduces the number of multiple hypothesis, thus lowers the false discovery rate and increases the power of detection when applying differential expression analysis.

We show both methods scale up and successfully perform on big and technically complex datasets. Moreover, we demonstrate that combining both methods on a dataset designed to investigate correlations between Tumor Infiltrating Lymphocytes (TIL) population characteristics from donors which underwent adoptive cell transplant (TIL-ACT) to treatment success, uncovers previously unattained discoveries. By using SSPM to combine all samples while maintaining high resolution and applying sci-EF to lower the number of multiple hypothesis and increase detection power, we were able to identify a subset of T cells with designating markers CD8CD69 which over-expresses CD33 protein in patients with positive response to treatment compared to patients who responded poorly. This finding may serve as an indication to treatment success, which currently resides on 52%.

SSPM and sci-EF are intended to be incorporated in the common pipeline of single-cell data analysis, to increase discoveries by reducing the harmful effect of high heterogeneity.