Ph.D Thesis

Ph.D StudentWiener Yair
SubjectTheoretical Foundations of Selective Prediction
DepartmentDepartment of Computer Science
Supervisor PROF. Ran El-Yaniv
Full Thesis textFull thesis text - English Version


In selective prediction, a predictor is allowed to abstain on part of the domain. The objective is to reduce prediction error by compromising coverage. This research is concerned with the study of the theoretical foundations of selective prediction and its applications for selective classification, selective regression, and active learning.

We present a new family of selective classification strategies called LESS (low error selective strategies). The labels predicted by LESS for the accepted domain are guaranteed to be identical to the labels predicted by the best hypothesis in the class, chosen in hindsight. Therefore, the estimation error of the predictor chosen by LESS is zero. Extending the idea to regression we also present a strategy called ε-LESS whose predictions are ε-close to the values predicted by the best (in hindsight) regressor in the class.

We study the coverage rates of LESS (and ε-LESS) for classification and regression. Relying on a novel complexity measure termed characterizing set complexity, we derive both data-dependent and distribution-dependent guarantees on the coverage of LESS for both realizable and agnostic classification settings. These results are interesting because they allow for training selective predictors with substantial coverage whose estimation error is essentially zero.

Moreover, we prove an equivalence between selective (realizable) classification and stream-based active learning, with respect to learning rates. One of the main consequences of this equivalence is an entirely novel technique to bound the label complexity in active learning for numerous interesting hypothesis classes and distributions. In particular, using classical results from probabilistic geometry, we prove exponential label complexity speedup for actively learning general (non-homogeneous) linear classifiers when the data distribution is an arbitrary high-dimensional mixture of Gaussians.

While direct implementations of the LESS (and ε-LESS) strategies appear to be intractable, we show how to reduce LESS to a procedure involving few calculations of constrained empirical risk minimization (ERM). Using this reduction, we develop a new principle for rejection, termed the disbelief principle, and show an efficient implementation for ε-LESS for the case of linear least squared regression (LLSR).