M.Sc Thesis | |

M.Sc Student | Frank Ari |
---|---|

Subject | A New Branch and Bound Feature Selection Algorithm |

Department | Department of Computer Science |

Supervisors | PROF. Dan Geiger |

ASSOCIATE PROF. Zohar Yakhini |

Feature selection is
used to reduce the dimensionality of the data in order to both improve the
accuracy and reduce the computational effort involved in the task of
classification. We present a *Branch and Bound* algorithm for finding a
subset of *k* independent Gaussian features that minimizes the
classification error of naive Bayesian classifiers. Our algorithm uses additive
monotonic distance measures, such as the Bhattacharyya and divergence
distances, to produce bounds for the Bayesian classification error in order to
exclude many feature subsets from evaluation. The search for the optimal subset
performed by the algorithm is complete, i.e., every feature subset excluded is
one that could not possibly have the lowest Bayesian classification error. In
many cases the algorithm needs to evaluate only a small fraction of the
possible feature subsets, thus enabling a very efficient search.

Our algorithm, when augmented with some heuristic adjustments, is shown to perform well on gene expression datasets. Selecting 5 features from training data typically suffices to correctly classify the test data. This selection is typically done by examining less than 10% of the feature subsets. Small subsets of genes that classify tissues with high success rates are potential basis for classification assays that use small-scale low-cost gene expression measurement methods, such as RT-PCR.