טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentEgozi Ofer
SubjectConcept-Based Information Retrieval Using Explicit
Semantic Analysis
DepartmentDepartment of Computer Science
Supervisor Professor Shaul Markovitch
Full Thesis textFull thesis text - English Version


Abstract

Information Retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same human concept in the documents and in the queries. Furthermore, the relationship between those keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually-built thesauri, by relying on term co-occurrence data, or by extracting latent word relationships and concepts from a corpus. In this paper we introduce a new concept-based retrieval method that is based on Explicit Semantic Analysis (ESA). ESA is a recently proposed representation method that can augment the keyword-based representation with concept-based features, automatically extracted from massive human knowledge resources such as Wikipedia. We have found that high-quality feature selection is required to make the retrieval more focused. However, due to the lack of labeled data, traditional statistical filtering methods cannot be used. We introduce several selection methods that use self-generated labeled training data. The resulting system is evaluated on TREC data, showing superior performance over previous state-of-the-art results.