טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentMarie Anan
SubjectSecond Line Schema Matchers
DepartmentDepartment of Computer Science
Supervisors Professor Carmel Domshlak
Professor Avigdor Gal
Full Thesis textFull thesis text - English Version


Abstract

Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. It is recognized to be one of the basic operations required by the process of data and schema integration, and thus has a great impact on its outcome. Over the years, a significant body of work was devoted to the identification of schema matchers, heuristics for schema matching. This approach does not deliver satisfactory results as of yet, maybe because the right "silver bullet" is yet to be found among the existing approaches or maybe because we have been searching in the wrong place all along. In this work we introduce the notion of second line schema matchers, matchers that operate on the outcome of other matchers to improve their original outcome. The motivation to second line schema matchers is the lack of industrial strength of existing matchers. We use examples to classify existing matchers into first line and second line matchers and provide an analytical comparison of two common second line matchers. We then introduce five new second line heuristics and show their comparative performance through a thorough empirical analysis of 230 schemata. Two of the heuristics use machine learning techniques, namely Naive Bayes and boosting. The former was used in the context of schema matching before but not for second line matchers. We empirically analyze the properties of the Naive Bayes heuristic using both real world and synthetic data. Our empirical analysis shows that the proposed heuristic performs well, given an accurate modeling of uncertainty in matcher decision making. We also discuss the current limitations of the heuristic and in particular its naive assumption regarding matcher independence. The latter - the boosting approach - to the best of our knowledge, was never used for schema matching. Our comparative analysis shows that boosting works best for improving the outcome of first-line schema matchers. We also shows that we can improve the schema matching boosting algorithm by performing a filtering operation on the learning dataset in order to eliminate “disturbing” examples from the dataset.