טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentYechiel Lamash
SubjectA Novel Feature Selection Method
DepartmentDepartment of Biomedical Engineering
Supervisors Professor Emeritus Gath Isak
Professor Gur Moshe
Full Thesis text - in Hebrew Full thesis text - Hebrew Version


Abstract

Multivariate analysis applied to gene expression profiles, using the recently developed 'high-throughput' techniques, aid in clarifying pathophysiological mechanisms of diseases and might promote their cure. Pharmacogenetic research is based on multivariate analysis for fitting a pharmacological treatment to a patient given his gene expression profile. Finding a decision rule that maps each patient to its optimal remedy (when there are several options) is a known problem of classifying a patient to one of several classes that represents the different remedies. Since there are dozens of thousands of genes, the discriminative gene profile (between the various patient classes) must be first identified. The identification of such gene profiles is known in the field of pattern recognition as feature selection.


A great number of feature selection methods are available in the literature. However, there is no general method that produces optimal performance for all different kinds of feature selection problems. Different methods usually make assumptions on the data distribution. The considerations which are involved in selecting a feature selection method include: dimensionality, sample size, complexity, prior knowledge, assumptions on the data structure, empirical results and so forth.


In the present study that deals with genetic data, a new feature selection method was developed. The method combines unsupervised and supervised exploration of the data and has an embedded ability to operate on pure labeled set and on a combination of labeled and unlabeled samples. The method consists of an objective function which produces a separation score according to the homogeneity in the labeling of samples which are located in the nearest vicinity to each other ("spatial entropy").

The method which has been tested on simulations and on real world data shows good performances.