Ph.D Thesis | |

Ph.D Student | Wiener Yair |
---|---|

Subject | Theoretical Foundations of Selective Prediction |

Department | Department of Computer Science |

Supervisor | PROF. Ran El-Yaniv |

Full Thesis text |

In selective prediction, a predictor is allowed to abstain on part of the domain. The objective is to reduce prediction error by compromising coverage. This research is concerned with the study of the theoretical foundations of selective prediction and its applications for selective classification, selective regression, and active learning.

We present a new family of selective
classification strategies called *LESS (low error selective strategies)*.
The labels predicted by LESS for the accepted domain are guaranteed to be
identical to the labels predicted by the best hypothesis in the class, chosen
in hindsight. Therefore, the estimation error of the predictor chosen by LESS
is zero. Extending the idea to regression we also present a strategy called *ε-LESS*
whose predictions are ε-close to the values predicted by the best (in
hindsight) regressor in the class.

We study the coverage rates of LESS
(and ε-LESS) for classification and regression. Relying on a novel
complexity measure termed *characterizing set complexity*, we derive both
data-dependent and distribution-dependent guarantees on the coverage of LESS
for both realizable and agnostic classification settings. These results are
interesting because they allow for training selective predictors with
substantial coverage whose estimation error is essentially zero.

Moreover, we prove an equivalence between selective (realizable) classification and stream-based active learning, with respect to learning rates. One of the main consequences of this equivalence is an entirely novel technique to bound the label complexity in active learning for numerous interesting hypothesis classes and distributions. In particular, using classical results from probabilistic geometry, we prove exponential label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high-dimensional mixture of Gaussians.

While direct implementations of the
LESS (and ε-LESS) strategies appear to be intractable, we show how to
reduce LESS to a procedure involving few calculations of constrained empirical
risk minimization (ERM). Using this reduction, we develop a new principle for
rejection, termed the *disbelief principle*, and show an efficient
implementation for ε-LESS for the case of linear least squared regression
(LLSR).