טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentEshel Yotam
SubjectAn Attention-RNN Based Approach for Named Entity
Disambiguation with Noisy Texts
DepartmentDepartment of Computer Science
Supervisor Professor Shaul Markovitch


Abstract

Named entity disambiguation (NED) is the task of linking mentions of entities in text to a knowledge base, such as Freebase or Wikipedia. Currently, research on the task of NED is driven by a number of standard datasets. These datasets are based on news and encyclopedic texts that are naturally coherent, well-structured and rich. However, texts in other scenarios such as web fragments, social media or search queries are shorter, less coherent and more challenging in general.


To address the task of NED for noisy text we design a novel neural model based on RNNs and attention. Our algorithm can utilize large amounts of training samples and learn to capture the limited and noisy local context surrounding entity-mentions in noisy text. We train our model with a novel method for sampling informative negative examples. In addition, we describe a new way of initializing word and entity embeddings that significantly improves performance.


To facilitate research on NED with noisy text, we present WikilinksNED: A large-scale NED dataset of text fragments from the web that is based on the Wikilinks dataset. Our dataset is orders of magnitude larger, significantly noisier and more challenging than existing news-based datasets.


We evaluate our model both on WikilinksNED and a smaller newswire dataset and find our model significantly outperforms existing state-of-the-art methods on WikilinksNED while achieving comparable performance on the smaller dataset.