Ph.D Thesis

Ph.D StudentNachum Ido
SubjectOn the Information Complexity of Learning
DepartmentDepartment of Mathematics
Supervisor PROF. Amir Yehudayoff


Machine learning and information theory tasks involve identifying patterns and regularities in data.  Information theory algorithms find regularities to compress data for storage, and learning algorithms utilize observable patterns to compress their dataset to achieve generalization. The thesis formally studies connections between the two theories.
In the common setting of learning theory, an algorithm receives an input or training data which is an i.i.d. sample of a function and returns an output, a hypothesis that should fit the sample. One can measure the compression of a learning algorithm using information theory.  We use the quantity I(input; output), the mutual information between the training data and the output hypothesis of the learning algorithm, measured in bits. 
In broad strokes, we investigate how many bits of information must a learning algorithm retain from its original input when learning a pre-specified class of functions H. This quantity is called the information complexity of H.
Compression implies learning. A learning algorithm that retains a small amount of information from its input, i.e. I(input; output) is small,  is guaranteed to generalize well. Specifically, if the learning algorithm has zero empirical error, then the true error is small.
Learning does not imply compression. Under a worst-case analysis, there exist PAC learnable classes of functions for which there is no empirical risk minimizer (ERM) algorithm that compresses its input. For example, for any ERM for the class of thresholds, there exists a threshold function and a distribution such that the information revealed by the algorithm I(input; output) is arbitrarily large. The information complexity of learning thresholds is unbounded.
Learning implies compression. Under an average-case analysis, for all PAC learnable classes of functions there exists an ERM learning algorithm that compresses its input well. Specifically, for every distribution over inputs the information revealed by the algorithm is O(d) on average over the entire class, where d is the VC dimension of the class. The average information complexity of learning VC classes is bounded.

Direct sum result. The information complexity of a direct sum of learning problems roughly equals the sum of the individual complexities.