Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Brigham Young University

Series

2008

Data deduplication

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

Learning-Based Fusion For Data Deduplication, Sabra Dinerstein, Parris K. Egbert, Stephen W. Clyde, Jared Dinerstein Dec 2008

Learning-Based Fusion For Data Deduplication, Sabra Dinerstein, Parris K. Egbert, Stephen W. Clyde, Jared Dinerstein

Faculty Publications

Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically …