Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Andrew McCallum

2007

Information extraction

Articles 1 - 1 of 1

Full-Text Articles in Computer Sciences

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum Jan 2007

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …