טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentLiberman Sofia
SubjectWikipedia-Based Compact Hierarchical Semantics for
Natural Language Processing
DepartmentDepartment of Computer Science
Supervisor Professor Shaul Markovitch
Full Thesis textFull thesis text - English Version


Abstract

A correct semantic representation of words and texts underlies many text processing tasks such as text categorization, word sense disambiguation, and semantic relatedness assessment. It has long been recognized that computers require access to common-sense and domain-specific world knowledge in order to process textual data at a deeper level. In this paper, we present a novel representation of semantics that is based on the structured encyclopedic knowledge encoded within Wikipedia articles and categories and the conceptual hierarchy inferred from this knowledge base. Our method, called Compact Hierarchical Explicit Semantic Analysis (CHESA), generates hierarchical semantic representations of unrestricted natural language texts. It represents semantics as a compact hierarchical structure of predefined natural concepts, capturing semantics at different abstraction levels and constructing representations at any given size, depending on the task at hand. In comparison to previous methods, CHESA generates very intuitive and comprehensible representations that allow deep semantic reasoning and understanding. We present a methodology to compute semantic relatedness using CHESA representations and evaluate CHESA on the task of semantic relatedness assessment of words and texts. Empirical results show that, for compact representations, CHESA is superior to the previous state of the art.