|Ph.D Student||Moroshko Edward|
|Subject||On Implicit Bias in Deep Models and Constrained|
Feedback in Online Learning
|Department||Department of Electrical and Computer Engineering||Supervisor||ASSOCIATE PROF. Daniel Soudry|
|Full Thesis text|
In recent years, deep neural networks are considered among the best learning models for many domains, like computer vision, natural language processing, speech recognition and robotics. These models are non-convex and highly overparameterized, and thus their optimization and generalization properties cannot be explained by classical machine learning tools. Despite this overparameterization, it was recently suggested that some implicit regularization effect is limiting the capacity of the obtained models along the optimization path, leading to simple solutions, and thus explaining generalization on realistic datasets.
We study how the initialization scale affects this implicit regularization. For regression with the square loss, it was recently shown how the scale of initialization controls the implicit regularization in-between two extreme regimes. In the first regime, when the initialization is large, a model behaves like a kernelized linear predictor, and gradient descent implicitly biases towards the minimum reproducing kernel Hilbert space norm solution. In the second regime, when the initialization is small, gradient descent induces a rich inductive bias minimizing a non-Hilbertian norm. We focus on classification with the exponential loss, analyze gradient descent trajectories and show how the implicit regularization is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Furthermore, we show that for reasonable initialization, the training loss should be extremely small in order to achieve non-Hilbertian implicit bias. Nevertheless, larger depth implicitly biases further away from Hilbert-norm solutions, and this usually results in better generalization.
In a different set of works, we examine online learning with constrained feedback. We address two variants of the classical online learning framework. In the first problem, called selective sampling classification, labels can be received only by a query, and the goal is to achieve a good classification performance while using fewer labels. We propose an algorithm for this setting designed to work in a non-stationary environment, derive a mistake bound, and empirically demonstrate its merits. The second problem is online regression with label noise. We derive an algorithm that modifies the learning rate based on a per-sample estimate of the noise, and demonstrate its performance on synthetic and real-world datasets.