|M.Sc Thesis||Department of Computer Science|
|Supervisors:||Prof. Geiger Dan|
|Dr. Zohar Yakhini|
Feature selection is used to reduce the dimensionality of the data in order to both improve the accuracy and reduce the computational effort involved in the task of classification. We present a Branch and Bound algorithm for finding a subset of k independent Gaussian features that minimizes the classification error of naive Bayesian classifiers. Our algorithm uses additive monotonic distance measures, such as the Bhattacharyya and divergence distances, to produce bounds for the Bayesian classification error in order to exclude many feature subsets from evaluation. The search for the optimal subset performed by the algorithm is complete, i.e., every feature subset excluded is one that could not possibly have the lowest Bayesian classification error. In many cases the algorithm needs to evaluate only a small fraction of the possible feature subsets, thus enabling a very efficient search.
Our algorithm, when augmented with some heuristic adjustments, is shown to perform well on gene expression datasets. Selecting 5 features from training data typically suffices to correctly classify the test data. This selection is typically done by examining less than 10% of the feature subsets. Small subsets of genes that classify tissues with high success rates are potential basis for classification assays that use small-scale low-cost gene expression measurement methods, such as RT-PCR.