M.Sc Thesis

M.Sc StudentAnaby-Tavor Ateret
SubjectEnhancing the Formal Similarity Based Matching Model
DepartmentDepartment of Industrial Engineering and Management
Supervisor PROF. Avigdor Gal


The ambiguous interpretation of concepts, describing the meaning of data in heterogeneous data sources is commonly known as semantic heterogeneity.

Semantic heterogeneity is a well-known obstacle to data source integration. Semantic heterogeneity is resolved through a process of semantic reconciliation, which matches concepts from heterogeneous data sources. The shift towards machine understandable Web resources has unearthed the importance of automatic semantic reconciliation. However, automatic matching carries with it a degree of uncertainty, as it is based on syntactic, rather than semantic means. Moreover, it is commonly acknowledged that no automatic semantic reconciliation algorithm can be assumed to identify the exact mapping (as analyzed by human) in all scenarios.

This work addresses the issue of the inherent uncertainty in the automatic matching process outcome, a key problem in the schema integration process.

The thesis is twofold: We first present the monotonicity principle, which is a desired attribution of a matching algorithm. We demonstrate through theoretical and empirical analysis that for a certain family of ''well behaved'' mappings (we term monotonic), one can safely interpret a high similarity measure as a good semantic mapping. We assert that this formal model offers a solid foundation for analyzing the quality of existing matching algorithms (a joint work with A. Trombetta, Università dell’Insubria, Italy). Next, we propose three efficient algorithms for identifying the exact mapping among the K mappings with highest rank (a joint work with A. Moss, Netanya Academic College). Given two schemata with n1 and n2 concepts, the first algorithm identifies the second best mapping in a time complexity of O(n) C(Abest), where n=max(n1,n2), Abest is the best (complexity-wise) algorithm for finding a maximum weight mapping in a bipartite graph, and C(Abest) denotes the complexity of the algorithm Abest. The second algorithm identifies the second best mapping in a time complexity of O(n3), regardless of Abest. We then show that these algorithms do not scale for a general K. Therefore, we suggest a polynomial algorithm (O(Kn) C(Abest)) for identifying top-K mappings.

We assert that an algorithm for mapping schemata that obeys the monotonicity principle would rank the exact mapping close to the best one (as generated by a matching algorithm). Thus, a system (such as a software agent) that is provided with a monotonic algorithm and equipped with our top-K search algorithms and a user query, can rewrite K queries, each with a slightly different mapping in its quest for the exact mapping.