|Ph.D Student||Dror Rotem|
|Subject||Structured Solutions in Natural Language Processing:|
Algorithms and Statistically Sound Evalution
|Department||Department of Industrial Engineering and Management||Supervisor||ASSOCIATE PROF. Roi Reichart|
|Full Thesis text|
Data-driven experimental analysis has become the primary evaluation tool of Natural Language Processing (NLP) algorithms. In fact, in the last decade, it has become rare to see an NLP paper, particularly one that proposes a new algorithm, that does not include extensive experimental analysis, and the number of involved tasks, datasets, domains, and languages is continually growing. In this thesis, we explore the following observation: learning algorithms for structured prediction, which is the prominent learning framework in many NLP tasks and applications, e.g., syntactic parsing and machine translation, can benefit from partially optimal or inexact solutions. This observation allowed us to develop two new algorithms that are suitable for data-driven experimental settings, as is prevalent in NLP. After that, we have recognized that the structured nature of NLP tasks and the fact that correct sub-structures of a non-perfect solution are valuable, require an appropriate framework for algorithm comparison. Hence we devised a complete framework for comparing two algorithms and determining which one is superior in terms of performance in a statistically sound manner. Our guiding assumption throughout this comparison framework is that the fundamental question NLP researchers and engineers deal with is whether or not one algorithm can be considered better than another one. This question drives the field forward as it allows the constant progress of developing better technology for language processing challenges. In practice, researchers and engineers would like to draw the right conclusion from a limited set of experiments and sometimes noisy or inexact data, and this conclusion should hold for other experiments with datasets they do not have at their disposal or that they cannot perform due to limited time and resources. Hence in the second part of this thesis, we deal with the challenges of statistical hypothesis testing in contemporary NLP research. We first relax the single solution assumption, providing a hypothesis testing framework for the comparison between non-deterministic algorithms such as Deep Neural Networks. Afterward, we relax the single dataset assumption and discuss the desired statistical procedures that account for comparisons between algorithms across multiple datasets. Finally, we present a list of open questions and challenges that are still to be addressed in future research.