טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentNahshon Yoav
SubjectRelational Framework for information Extraction
DepartmentDepartment of Computer Science
Supervisor Professor Benny Kimelfeld


Abstract

Unstructured textual data conceals within itself structured data, and oftentimes it is accompanied by metadata. However, relational databases, which are highly suitable for storing structured data, typically treat text as a black box, as they lack the means for handling it sufficiently. In the context of text analytics, an essential component in many applications from this domain is Information Extraction (IE), the task of extracting (in a structured format) valuable knowledge from textual data. Typically, modern IE pipelines are constructed by (1) loading textual data from a database into a special-purpose application, (2) applying to the text a myriad of text-analytics functions that produce a structured relational table, and (3) storing this table in a database. However, this approach is prone to laborious development processes, complex and tangled programs, and inefficient control flows. These deficiencies have given rise to declarative solutions that automate significant parts of the manual work. However, such frameworks typically stitch together various programming components and technologies, and may lack an all-binding theory. In this thesis we embark on an effort to lay foundations of general purpose and text centric database management systems. Concretely, we introduce a novel formal framework, called Spannerlog, where we extend the relational model by incorporating into it the theory of document spanners, and define a Datalog-like query language for this model. Our main contribution is a uniform framework for textual data management w.r.t. unstructured data (text), structured data (extracted information and metadata like identifiers and timestamps), and functions that carry out transformations from the former to the latter. The formal foundations on which we built on our framework provide new capabilities and opportunities to be explored: (1) a better understanding of the system through theoretical studies; here we report on initial results concerning the expressive power of Spannerlog programs. (2) Diminished software complexity; on a single framework developers can write IE programs and query the extracted information in concise and readable manner.

(3) New optimization opportunities due to static program analysis on top of Spannerlog's formalism; to illustrate these opportunities we present the notion of split correctness, that enables the construction of parallel execution plans based on data splitting, while providing provable correctness. We believe that the formalism of Spannerlog will have a substantial impact on the way systems manage and query textual data.