|Ph.D Student||Peterfreund Liat|
|Subject||The Complexity of Relational Queries over Extractions from|
|Department||Department of Computer Science||Supervisor||Professor Benny Kimelfeld|
Large amounts of textual data with high potential value are prevalent nowadays.
Incorporating textual data within traditional database systems usually consists
of two stages (a) extracting relations from the text using primitive extractors and
(b) manipulating the extracted relations with relational algebra operations. Document
spanners is a formal framework proposed by Fagin, Kimelfeld, Reiss, and
Vansummeren, that captures the relational philosophy of commercial database
systems for text analytics.
A document spanner (or spanner, for short) is a function that maps a string
(document) into a relation over spans which are interval positions of the string.
Evaluating spanners is a computational problem that involves both the atomic extractors, the relational algebra expression, and the document. We investigate the
complexity of this problem from various angles, each regarding some components
as fixed and regarding the rest as input. We focus on regular atomic extractor
specified by means of either regular expressions or finite state automata. We
present lower bounds and devise different restrictions and approaches to establish
efficient upper bounds. We study the expressive power of spanners that involve
recursion and the implications of extensions that support partial extractions in
the flavor of SPARQL.