Open Access. Powered by Scholars. Published by Universities.®

Digital Commons Network

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

University of Massachusetts Amherst

Information extraction

Articles 1 - 6 of 6

Full-Text Articles in Entire DC Network

Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum May 2010

Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum

Andrew McCallum

Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on “distantly-labeling” a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3:5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of …


Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum Jan 2008

Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum

Andrew McCallum

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent---because venues tend to focus on a few research areas---but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set …


Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum Jan 2007

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …


Corrective Feedback And Persistent Learning For Information Extraction, Aron Culotta, Trausti Kristjansson, Andrew Mccallum, Paul Viola Jan 2006

Corrective Feedback And Persistent Learning For Information Extraction, Aron Culotta, Trausti Kristjansson, Andrew Mccallum, Paul Viola

Andrew McCallum

To successfully embed statistical machine learning models in real world applications, two post-deployment capabilities must be provided: (1) the ability to solicit user corrections and (2) the ability to update the model from these corrections. We refer to the former capability as corrective feedback and the latter as persistent learning. While these capabilities have a natural implementation for simple classification tasks such as spam filtering, we argue that a more careful design is required for structured classification tasks. One example of a structured classification task is information extraction, in which raw text is analyzed to automatically populate a database. In …


Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum Jan 2004

Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum

Andrew McCallum

The ability to find tables and extract information from them is a necessary component of question answering and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multidimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional retrieval techniques. This paper describes techniques for extracting tables from text and retrieving answers from the extracted information. We compare machine learning (especially conditional random fields) and heuristic methods for table extraction. Our approach creates a cell document, which …


Table Extraction Using Conditional Random Fields, David Pinto, Andrew Mccallum, Xing Wei, W. Bruce Croft Jan 2003

Table Extraction Using Conditional Random Fields, David Pinto, Andrew Mccallum, Xing Wei, W. Bruce Croft

Andrew McCallum

The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional language modeling techniques, however. This paper presents the use of conditional random fields (CRFs) for table extraction, and compares them with hidden Markov models (HMMs). Unlike HMMs, CRFs support the use of many rich and overlapping layout …