|M.Sc Student||Cohen Indelman Hedda|
|Subject||Learning Randomly Perturbed Structured|
Predictors for Direct Loss Minimization
|Department||Department of Industrial Engineering and Management||Supervisor||ASSOCIATE PROF. Tamir Hazan|
|Full Thesis text|
Learning to predict structured labels of data instances covers a wide
range of problems. The structure is incorporated into the label which may refer
to matchings, permutations, sequences, or other high-dimensional objects.
For any data instance x, its different structures y = (y1, ? ,yn) are scored by a
parametrized score function. Discriminative learning aims to find a mapping from
training data to parameters of the score function for which high scores are assigned
to structures that describe the data instance well. The parameters of the score function are fitted to minimize a loss(•,•) of the instance-label training pairs between the label and the highest scoring structure. While gradient methods are the most popular methods to learn the scoring function parameters, they are notoriously inefficient for learning discrete predictions. When considering discrete labels, the maximal argument of the scoring function is a piecewise constant function of its parameters, and its gradient with respect to the parameters is zero almost anywhere. Consequently, various smoothing techniques were proposed to propagate gradients while learning to fit discrete structures.
Direct loss minimization is a popular approach for learning predictors over structured
label spaces. This approach allows to propagate gradients in a deep net using loss-perturbed prediction. In other words, the gradient estimator is obtained by performing pairs of maximization operations, one over the original objective and one over a loss perturbed objective. Recently, this technique was extended to generative models, by introducing a randomized predictor that samples a structure from a score function randomly perturbed with Gumbel random noise with zero expectancy.
In this work, we interpolate between these techniques by learning the variance of
randomized structured predictors, as well as their mean, in order to balance between
the learned score function and the randomized noise. We do so by introducing a learnable variance scaling parameter. We argue that the learnable variance scaling parameter determines the degree of smoothness enforced on the probability distribution of the max predictor. As it increases, the expected max predictor value converges to a uniform distribution over the discrete structures. And vice versa, as it decreases, the expected max predictor value approaches the expected value of a categorical random variable.
We then make the connection between the variance of Gumbel random variables and the temperature of Gibbs models.
Our proof technique adds a complete treatment to the structured setting. It adds
to the direct loss minimization the condition that the maximizing structure may be
non-unique only in a set of noise perturbations with measure zero. This contributes
a mathematical condition that replaces the “general position assumption” which was
needed since if all weights are zero, all structures maximize the score function equally. This condition can be enforced using the explicit random perturbation that appears in our work, and need not rely on implicit conditions on the data generation distribution.
We demonstrate empirically the effectiveness of learning this balance in two popular
structured prediction problems: bipartite matching and k-nearest neighbors.