Ph.D Thesis

Ph.D StudentMoran Shay
SubjectGeneralization and Simplification in Machine Learning
DepartmentDepartment of Computer Science
Supervisors PROF. Amir Yehudayoff
Full Thesis textFull thesis text - English Version


Generalization and simplifcation are deeply related: simpler explanations often reveal

principles that apply more generally, and predictive scientific theories often boil down

to a few simple (and deep) laws. This work studies the “generalization - simplification” link within the formal setting of machine learning. It focuses on sample compression schemes - a type of learning algorithms that possess a certain “simplification” property. Roughly speaking, a sample compression scheme of size k means that given an arbitrary list of labeled examples, one can retain only k of them in a way that allows to recover the labels of all other examples in the list.

Littlestone and Warmuth (1986) defined sample compression schemes in the context

of binary classification. They showed that compression implies generalization, and asked whether the other direction holds. The first part of this work gives an affirmative answer to this question, by showing that every hypothesis class with VC-dimension d has a sample compression scheme of size exponential in d.

The second part of this work extends the “compressibility - learnability” connection

beyond binary classification. We first consider the setting of multiclass categorization: we prove that learnability is equivalent to compression of logarithmic sample size and that the uniform convergence property implies compression of constant size. We use the compressibility - learnability equivalence to show that (i) for multiclass categorization, PAC and agnostic PAC learnability are equivalent, and (ii) to derive a compactness theorem for learnability. We then consider supervised learning under general loss functions: we show that in this case, in order to maintain the compressibility - learnability equivalence, it is necessary to consider an approximate variant of compression. We use it to show that PAC and agnostic PAC are not equivalent, even when the loss function has only three values.

The last part of this work addresses the power of SVM and kernel machines - a

well-studied family of sample compression learning algorithms that is highly used in

practice. We study the question “Which learnable problems can be represented in a way that enables learning by SVM?". Technically, this question boils down to relating two quantities: the VC-dimension and the sign rank. The connection we establish between these parameters has numerous implications to other fields such as complexity theory, combinatorics, and geometry; for example, it allows us to answer a question of Frankl from '89 and a question of Ben-David et al. from '02 about maximum classes.