Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Utah State University

Theses/Dissertations

2010

Active learning

Articles 1 - 1 of 1

Full-Text Articles in Entire DC Network

Learning-Based Fusion For Data Deduplication: A Robust And Automated Solution, Jared Dinerstein Dec 2010

Learning-Based Fusion For Data Deduplication: A Robust And Automated Solution, Jared Dinerstein

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules.

The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution …