טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentLin Ya-Wei
SubjectGraph Analysis for Multiplexed Data with Application to
Image Mass Cytometry
DepartmentDepartment of Electrical Engineering
Supervisor Professor Ronen Talmon
Full Thesis textFull thesis text - English Version


Abstract

Hyper spectral imaging, sensor networks and spatial multiplexed proteomics or transcriptomics assays is a representative subset of distinct technologies from diverse domains of science and engineering that share common data structures. The data in all these modalities consist of high-dimensional multivariate observations ($m$-dimensional feature space) collected at different spatial positions and therefore can be analyzed using similar computational methodologies. Furthermore, in many studies practitioners collect datasets consisting of multiple spatial assays of this type, each capturing such data from a single biological sample, patient, or hyper spectral image. Each of these spatial assays could be characterized by several regions of interest (ROIs).

To extract meaningful information from the multi-dimensional observations recorded at different ROIs across different assays, we propose to analyze such datasets using a two-step graph-based approach, thereby constructing a graph of graphs. We first construct for each ROI a graph representing the interactions between the  covariates and compute an  dimensional vector characterizing the steady state distribution (SSD) among features. We then use all these -dimensional vectors of SSDs to construct a graph between the ROIs from all assays. This second graph is subjected to a nonlinear dimension reduction analysis, retrieving the intrinsic geometric representation of the ROIs. Such a representation provides the foundation for efficient and accurate organization of the different ROIs based on their functional groups.

We showcase the advantage of the proposed method both theoretically and in practice. More concretely, we show theoretically that our method is beneficial for binary hypothesis testing. Particularly, in a Gaussian setting, it outperforms the MAP estimator based on a naive concatenation of all the covariates, implying that the mutual relations between the covariates and spatial coordinates are well captured by the SSDs. This analysis suggests that representing the underlying relationship between multivariate observations based on the intrinsic geometry of our graph of graphs outperforms a direct application of a standard estimator to the measurement or to the concatenation of the measurements.

We apply our approach to image mass cytometry (IMC). We demonstrate that our approach enables us to accurately predict the sensitivity to treatment outcome from the IMC data alone. We compare our approach to the well-known heat kernel signature (HKS) and to other standard baselines and show that our approach achieves a significant improvement. Importantly, our method is general and can be applied to other multiplexed datasets.

In addition, we studied the problem of batch effect removal, which arises in a broad range of applications, and in particular, in biology and bioinformatics. Here, we present an unsupervised approach based on the Riemannian geometry of symmetric positive definite (SPD) matrices. The proposed batch effect removal is based on parallel transport (PT) on the SPD manifold that we show how to apply directly to data. We show that this method provides a useful and efficient way to compare between samples from different batches. Experimental results demonstrate the batch effect removal of high-dimensional mass cytometry (CyTOF) data collected from two different subjects before and after treatment.