Ph.D Thesis | |

Ph.D Student | Moran Shay |
---|---|

Subject | Generalization and Simplification in Machine Learning |

Department | Department of Computer Science |

Supervisors | PROF. Amir Yehudayoff |

ASSOCIATE PROF. Amir Shpilka | |

Full Thesis text |

Generalization and simplifcation are deeply related: simpler explanations often reveal

principles that apply more generally, and predictive scientific theories often boil down

to a few simple (and deep) laws. This work studies the “generalization -
simplification” link within the formal setting of machine learning. It focuses
on sample compression schemes - a type of learning algorithms that possess a
certain “simplification” property. Roughly speaking, a sample compression
scheme of size *k* means that given an arbitrary list of labeled examples,
one can retain only *k* of them in a way that allows to recover the labels
of all other examples in the list.

Littlestone and Warmuth (1986) defined sample compression schemes in the context

of binary classification. They showed that
compression implies generalization, and asked whether the other direction
holds. The first part of this work gives an affirmative answer to this
question, by showing that every hypothesis class with VC-dimension *d* has
a sample compression scheme of size exponential in *d*.

The second part of this work extends the “compressibility - learnability” connection

beyond binary classification. We first consider the
setting of multiclass categorization: we prove that learnability is equivalent
to compression of logarithmic sample size and that the uniform convergence
property implies compression of constant size. We use the compressibility - learnability
equivalence to show that (*i*) for multiclass categorization, PAC and
agnostic PAC learnability are equivalent, and (*ii*) to derive a
compactness theorem for learnability. We then consider supervised learning
under general loss functions: we show that in this case, in order to maintain
the compressibility - learnability equivalence, it is necessary to consider an
approximate variant of compression. We use it to show that PAC and agnostic PAC
are not equivalent, even when the loss function has only three values.

The last part of this work addresses the power of SVM and kernel machines - a

well-studied family of sample compression learning algorithms that is highly used in

practice. We study the question “Which learnable problems can be represented in a way that enables learning by SVM?". Technically, this question boils down to relating two quantities: the VC-dimension and the sign rank. The connection we establish between these parameters has numerous implications to other fields such as complexity theory, combinatorics, and geometry; for example, it allows us to answer a question of Frankl from '89 and a question of Ben-David et al. from '02 about maximum classes.