|Ph.D Student||Granovsky Lena|
|Subject||Statistical Analysis of DNA Microarray Experiments|
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Paul Feigin|
|Full Thesis text|
Microarray experiments are widely applied in many areas of biomedical research. Despite the fact that many powerful algorithms have been developed, the results of microarray experiments are still inaccurate. The goal of this dissertation is to present new algorithms to produce a more accurate list of differentially expressed genes (DEGs).
The first part of the dissertation investigates the influence of applying different normalization techniques on to microarray data. We compare two methods and present statistical analyses that may be used for a validation of these methods. We also propose an extension to the invariant set normalization algorithm, which enables adding genes to the invariant set if it is not sufficiently large. Our results show that the normalization procedure has an influence on which genes are detected as differentially expressed.
The second part of the dissertation concerns the problem of identifying a set of DEGs. Permutation methods are commonly used to estimate a distribution for non-differentially expressed genes. However, different permutation methods lead to different estimates of the null distribution and consequently result in substantially different lists of DEGs. We extend the nonparametric empirical Bayes approach proposed by Efron by suggesting a number of permutation procedures, which are used for an empirical test of the appropriate null hypothesis. We examine the difference between applying these methods and a theoretical null distribution. The results of the study demonstrate that the choice of the permutation method should depend on the experimental design. We also show that in many applications the usual assumption of the null distribution is incorrect and in such cases may produce an incorrect list of differentially expressed genes.
Recently, genes set enrichment analyses (GSEA) have drawn much attention. Rather than assessing the significance of individual genes, GSEA algorithms assess the significance of pre-defined gene-sets, which are groups of genes believed to participate in a common biological function. In the third part of the dissertation we propose a new K-Means based approach to check if the joint distribution of expression levels for genes in a gene-set differs between control and treatment groups. We check the performance of the method on simulated data sets and compare this approach to a number of other non-parametric methods, some of them used for the first time in GSEA. We show that the performance of the methods depends on the number of DEGs in a gene set and the strength of the correlation between the genes.