טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentJonathan Svirsky
SubjectA Generic Model for Incremental Entity Resolution
DepartmentDepartment of Industrial Engineering and Management
Supervisor Full Professor Gal Avigdor
Full Thesis textFull thesis text - English Version


Abstract

Big data is commonly characterized via a set of “V”s, out of which three became most prominent. Big data is characterized by volumes of data to be gathered, managed, and

analyzed. Volumes of data are not foreign to data integration. Still, many contemporary data integration systems suffer from poor performance when it comes to a large amount of integrated data. Velocity is also a concern for data integration since tasks were often considered to be performed on-line. Big data variety is, in fact, the bread and butter of data integration. A massive cohort of work in data integration aims at homogenizing heterogeneous data sources. The combination of the three “V”s challenges our ability to perform data integration tasks. In this work, we are concerned with the specific task of Entity Resolution (ER), a data integration task that aims at “cleaning” noisy data collections by identifying entity profiles, or simply entities that represent the same real-world object. Exhaustive ER methods cannot scale to large volumes of data, due to their inherently quadratic complexity: in principle, each entity has to be compared with all others in order to find its matches. We propose a generic algorithm for managing the ER in an incremental manner and two instantiations of it using two existing ER tools. We test their ability to handle varying benchmark datasets in a satisfactory manner. Motivated by the needs of big data, we show the usefulness of incremental ER to overcome volume, velocity, and variety challenges.