טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentGolan Sapir
SubjectDuplicate Representative Problem in Probabilistic Entity
Resolution
DepartmentDepartment of Industrial Engineering and Management
Supervisors Professor Avigdor Gal
Dr. Tamir Hazan
Full Thesis textFull thesis text - English Version


Abstract

Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to creating a unified view of the data. Entity resolution is inherently an uncertain process be- cause the decision to map a set of records to the same block cannot be made with certainty unless these are identical in all of their attributes or have a common key attribute. One of the challenges of creating a unified view in such a setting involves the selection of a representative record from each block. This representative should contain as many as possible of the char- acteristics and information stored in the block tuples. Contemporary block- ing algorithms choose block representatives based solely on the similarity of tuples within the block and ignore the other information such as number of unique representatives. In this work, we introduce a representative se- lection approach that considers the presence of each tuple in all blocks and minimize the likelihood that the same tuple will be selected as a representa- tive of several blocks. This approach uses the ConvexBP algorithm to select the most appropriate representative per block. We report on a thorough empirical analysis, using synthetic datasets which exhibit the efficiency of our approach.