טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
Ph.D Thesis
Ph.D StudentPeterfreund Liat
SubjectThe Complexity of Relational Queries over Extractions from
Text
DepartmentDepartment of Computer Science
Supervisor Professor Benny Kimelfeld


Abstract

Large amounts of textual data with high potential value are prevalent nowadays.

Incorporating textual data within traditional database systems usually consists

of two stages (a) extracting relations from the text using primitive extractors and

(b) manipulating the extracted relations with relational algebra operations. Document

spanners is a formal framework proposed by Fagin, Kimelfeld, Reiss, and

Vansummeren, that captures the relational philosophy of commercial database

systems for text analytics.

A document spanner (or spanner, for short) is a function that maps a string

(document) into a relation over spans which are interval positions of the string.

Evaluating spanners is a computational problem that involves both the atomic extractors, the relational algebra expression, and the document. We investigate the

complexity of this problem from various angles, each regarding some components

as fixed and regarding the rest as input. We focus on regular atomic extractor

specified by means of either regular expressions or finite state automata. We

present lower bounds and devise different restrictions and approaches to establish

efficient upper bounds. We study the expressive power of spanners that involve

recursion and the implications of extensions that support partial extractions in

the flavor of SPARQL.