|M.Sc Student||Makovetsky Ruslana|
|Subject||Adopting Machine Learning Approaches to the Novelty|
|Department||Department of Industrial Engineering and Management||Supervisor||Professor Carmel Domshlak|
|Full Thesis text|
This work is devoted to exploiting a discriminative machine learning approach in dealing with the information retrieval task of novelty detection. On the one hand, novelty detection naturally constitutes a task of binary classification. On the other hand, the paradigm of interest is hard to learn as it is not clear why the knowledge acquired on the basis of some past experience should be effective when applied on a truly novel document. We introduce a novel machine learning approach to the novelty detection task, called RPN, which aims at modeling the paradigm of information novelty in a learning-oriented manner. Specifically, we reduce the novelty detection task to the task of binary classification of ordered pairs of documents, where the first document is a previously seen one, and the second document is a new document in question. For this classification, such ordered document pair is represented by a combination of three types of information, namely running information, passed information, and new information.
We suggest a concrete instance of the RPN approach, called term-based PRN, and evaluate it on a publicly available dataset. First, we specify two new measures for evaluating the performance of systems for novelty detection, called combined recall and error rate, aiming at exploiting advantages and fixing the shortcomings of the previously suggested measures. Next, we use the specified measures in evaluating the empirical performance of the term-based PRN framework on our dataset. On the one hand, at this point our setting of the approach still fails to overtake the performance of the (most effective so far) cosine-based baseline. On the other hand, the results show that our approach to modeling novelty detection problem is meaningful, and that the performance of the system improves with the training experience way beyond random performance. The latter suggests that future efforts in our direction are promising.