|M.Sc Student||Deutsch Amichay|
|Subject||A Perturbation Approach to Structured Prediction in|
Natural Language Processing
|Department||Department of Applied Mathematics||Supervisor||Professor Roi Reichart|
Structured prediction problems are ubiquitous in Natural Language Processing (NLP). While in most cases models for such problems are designed to predict the highest quality structure of the input example (e.g. a sentence or a document), in many cases a diverse list of meaningful structures is of fundamental importance.
This can stem from several reasons.
First, it can be a defining property of the task. For example, in extractive summarization good summaries are defined to be those that consist of a high quality and diverse list of sentences extracted from the text.
In other cases it is a first step towards learning a high quality structure, that cannot be learned by the model through standard argmax inference. For example, in the well-studied reranking setup, a K-best list of solutions is first extracted from a baseline learner, that typically has a limited feature space, and is transferred to another feature-rich model that chooses the best solution from this list. Other examples include bagging and boosting as well as other ensemble methods that are often applied when the data available for model training is limited, in cases where exact argmax inference in the model is infeasible or when training is not deterministic. In such cases, an ensemble of approximated solutions is fed into another model that extracts a final high quality solution.
Unfortunately, both alternatives suffer from inherent limitations.
K-best lists can be extracted by extensions of the argmax inference algorithm for many models: the K-best Viterbi algorithm for Hidden Markov Models and Conditional Random Fields, K-best Maximum Spanning Tree (MST) algorithms for graph-based dependency parsing etc.
However, the members of K-best lists are typically quite similar to each other and do not substantially deviate from the argmax solution of the model.
Ensemble techniques, in contrast, are often designed to encourage diversity of the K-list members, but they require the training of multiple models (often one model per solution in the K-list) which is prohibitive for large K values.
In this paper we propose a new method for learning K-lists from machine learning models, focusing on structured prediction models in NLP. Our method is based on the theory of perturbations , with a novel variance learning component. A particularly appealing property of the perturbations framework is that it supports computationally tractable sampling from the perturbated model, although this comes at the cost of the argmax operation often being intractable. This property allows us to sample high quality and diverse K-lists of solutions, while training only the base (non-perturbated) learner and a smooth noise function. To overcome the intractability of the argmax operation we employ an approximation and experimentally demonstrate its effectiveness.