Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 6 of 6
Full-Text Articles in Entire DC Network
Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum
Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum
Andrew McCallum
Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on “distantly-labeling” a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3:5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of …
Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum
Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum
Andrew McCallum
Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent---because venues tend to focus on a few research areas---but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set …
Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum
Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum
Andrew McCallum
It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …
Corrective Feedback And Persistent Learning For Information Extraction, Aron Culotta, Trausti Kristjansson, Andrew Mccallum, Paul Viola
Corrective Feedback And Persistent Learning For Information Extraction, Aron Culotta, Trausti Kristjansson, Andrew Mccallum, Paul Viola
Andrew McCallum
To successfully embed statistical machine learning models in real world applications, two post-deployment capabilities must be provided: (1) the ability to solicit user corrections and (2) the ability to update the model from these corrections. We refer to the former capability as corrective feedback and the latter as persistent learning. While these capabilities have a natural implementation for simple classification tasks such as spam filtering, we argue that a more careful design is required for structured classification tasks. One example of a structured classification task is information extraction, in which raw text is analyzed to automatically populate a database. In …
Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum
Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum
Andrew McCallum
The ability to find tables and extract information from them is a necessary component of question answering and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multidimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional retrieval techniques. This paper describes techniques for extracting tables from text and retrieving answers from the extracted information. We compare machine learning (especially conditional random fields) and heuristic methods for table extraction. Our approach creates a cell document, which …
Table Extraction Using Conditional Random Fields, David Pinto, Andrew Mccallum, Xing Wei, W. Bruce Croft
Table Extraction Using Conditional Random Fields, David Pinto, Andrew Mccallum, Xing Wei, W. Bruce Croft
Andrew McCallum
The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout …