Ph.D Thesis

Ph.D StudentOmari Adi
SubjectScalable Data Extraction via Program Synthesis
DepartmentDepartment of Computer Science
Supervisors PROF. Eran Yahav
ASSOCIATE PROFESSOR Sharon Shoham-Buchbind
Full Thesis textFull thesis text - English Version


Web extraction is an important research topic that has been studied extensively, receiving a lot of attention and focus. Large amounts of data are produced and consumed online in a continuous and growing rate. The ability to collect and analyze these data has become essential for enabling a wide range of applications and improving the effectiveness of modern businesses. Web extraction methods facilitate the collection and analysis of these data by transforming the human friendly data available online into structured information that can be automatically manipulated and analyzed.

In this work we address the data extraction problem from a software synthesis perspective. Our goal is not only to extract data from web-sites, but also to synthesize programs that extract the data from these sites. The growing popularity of data extraction query languages and their increasing use in a wide variety of applications make them a natural target for automatic synthesis methods. Another motivation for using synthesis for web extraction related applications is the fact that web applications are often generated dynamically using template code. Reverse engineering web pages at the page and site level may facilitate - among other applications - unsupervised web extraction.

 First, we focus on the problem of automatic synthesis of web-crawlers for a family of web-sites that contain the same kind of information but differ on layout and formatting. We propose a method that uses the data shared among sites from the same category in order to decrease or eliminate the manual tagging needed for generating extraction schemes for these sites.  We use the data on one site to identify data on another site. The identified data is then used to learn the website structure and synthesize an appropriate extraction scheme. This process iterates, as synthesized extraction schemes result in additional data to be used for re-learning the website structure.

 In our second work, we address the problem of synthesizing robust data extractors from a family of web-sites that contain the same kind of information. Robust data extractors are more likely to withstand structural changes in the data-source, and can therefore reduce the data extractor's maintenance cost. We introduce and implement the idea of forgiving extractors that dynamically adjust their precision to handle structural changes, without the need to sacrifice precision upfront.

Finally, to address the unsupervised data extraction problem, we propose a solution to the more general problem of separation of web-pages into template-code and data.  Web pages are often served by running layout code on data, producing an HTML document that formats the data into a human readable and elegant presentation.  We considered the opposite task: separating a given web page into a data component and a layout program.  This separation has various important applications including unsupervised data extraction, traffic compression, data migration and template-code simplification.  In our last work, we generalized our separation approach to address the problem of site-level separation.