M.Sc Thesis
M.Sc Student Toledano Haggai Coverage-Driven Refinement of Conceptual Representations Department of Computer Science Professor Shaul Markovitch

Abstract

Many text processing tasks are based on estimating semantic relatedness between texts. For example, in information retrieval, the relevancy of documents can be determined based on the semantic distance from the query. Recently, many algorithms have been developed for evaluating semantic relatedness based on a conceptual representation of the input texts. The concept spaces for these algorithms are based, in most cases, on large repositories of knowledge, such as Wikipedia and WordNet. Through these repositories, such representations are able to use more natural concepts and semantic relations than previous statistical corpora analysis based methods. The large concept spaces often yield representations that consist of very large collections of concepts. In many cases this has a negative impact on the performance of the semantic tasks due to redundancy that gives a superficially large weight to less relevant concepts, thus hiding important semantic aspects of the texts. In this work we present a new algorithm that produces semantic interpretations of texts in the form of conceptual representations which are based on hierarchical concept spaces. The algorithm incrementally adds strongly-associated concepts to the representation, while using the hierarchical structure of the semantic database to maximize coverage. Inherent to this algorithm is the problem of finding an acceptable trade-off between concept coverage, enabling a more detailed semantic interpretation of the texts, and concept redundancy which degrades the performance of semantic tasks. We suggest a solution to this problem, that uses the hierarchical structure of the semantic database to compute a stopping condition to the algorithm. We test the new algorithm for text relatedness tasks and show its advantage over existing approaches.