Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 11 of 11

Full-Text Articles in Physical Sciences and Mathematics

Confidence Estimation For Information Extraction, Aron Culotta, Andrew Mccallum Jan 2004

Confidence Estimation For Information Extraction, Aron Culotta, Andrew Mccallum

Andrew McCallum

Information extraction techniques automatically create structured databases from unstructured data sources, such as the Web or newswire documents. Despite the successes of these systems, accuracy will always be imperfect. For many reasons, it is highly desirable to accurately estimate the confidence the system has in the correctness of each extracted field. The information extraction system we evaluate is based on a linear-chain conditional random field (CRF), a probabilistic model which has performed well on information extraction tasks because of its ability to capture arbitrary, overlapping features of the input in a Markov model. We implement several techniques to estimate the …


Sign Detection In Natural Images With Conditional Random Fields, Jerod Weinman, Allen Hanson, Andrew Mccallum Jan 2004

Sign Detection In Natural Images With Conditional Random Fields, Jerod Weinman, Allen Hanson, Andrew Mccallum

Andrew McCallum

Traditional generative Markov random fields for segmenting images model the image data and corresponding labels jointly, which requires extensive independence assumptions for tractability. We present the conditional random field for an application in sign detection, using typical scale and orientation selective texture filters and a nonlinear texture operator based on the grating cell. The resulting model captures dependencies between neighboring image region labels in a data-dependent way that escapes the difficult problem of modeling image formation, instead focusing effort and computation on the labeling task. We compare the results of training the model with pseudo-likelihood against an approximation of the …


Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum Jan 2004

Table Extraction For Answer Retrieval, Xing Wei, Bruce Croft, Andrew Mccallum

Andrew McCallum

The ability to find tables and extract information from them is a necessary component of question answering and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multidimensional information. Tables do this by employing layout patterns to efficiently indicate fields and records in two-dimensional form. Their rich combination of formatting and content present difficulties for traditional retrieval techniques. This paper describes techniques for extracting tables from text and retrieving answers from the extracted information. We compare machine learning (especially conditional random fields) and heuristic methods for table extraction. Our approach creates a cell document, which …


Interactive Information Extraction With Constrained Conditional Random Fields, Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew Mccallum Jan 2004

Interactive Information Extraction With Constrained Conditional Random Fields, Trausti Kristjansson, Aron Culotta, Paul Viola, Andrew Mccallum

Andrew McCallum

Information Extraction methods can be used to automatically "fill-in" database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an Interactive Information Extraction system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, …


Piecewise Training With Parameter Independence Diagrams: Comparing Globally- And Locally-Trained Linear-Chain Crfs, Andrew Mccallum, Charles Sutton Jan 2004

Piecewise Training With Parameter Independence Diagrams: Comparing Globally- And Locally-Trained Linear-Chain Crfs, Andrew Mccallum, Charles Sutton

Andrew McCallum

We present a diagrammatic formalism and practial methods for introducing additional independence assumptions into parameter estimation, enabling efficient training of undirected graphical models in locally-normalized pieces. On two real-world data sets we demonstrate our locally-trained linear-chain CRFs outperforming traditional CRFs--training in less than one-fifth the time, and providing a statistically-significant gain in accuracy.


An Exploration Of Entity Models, Collective Classification And Relation Description, Hema Raghavan, James Allan, Andrew Mccallum Jan 2004

An Exploration Of Entity Models, Collective Classification And Relation Description, Hema Raghavan, James Allan, Andrew Mccallum

Andrew McCallum

Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database ontology. This paper explores the a middle ground we term entity models, in which questions about structured data may be posed and answered, but the complexities and task-specific restrictions of ontologies are avoided. An entity model is a language model or word distribution associated with an entity, such as a person, place or organization. Using these per-entity language models, entities may be clustered, links may be detected or described with a short summary, entities may be collectively classified, and question answering …


Collective Segmentation And Labeling Of Distant Entities In Information Extraction, Charles Sutton, Andrew Mccallum Jan 2004

Collective Segmentation And Labeling Of Distant Entities In Information Extraction, Charles Sutton, Andrew Mccallum

Andrew McCallum

In information extraction, we often wish to identify all mentions of an entity, such as a person or organization. Traditionally, a group of words is labeled as an entity based only on local information. But information from throughout a document can be useful; for example, if the same word is used multiple times, it is likely to have the same label each time. We present a CRF that explicitly represents dependencies between the labels of pairs of similar words in a document. On a standard information extraction data set, we show that learning these dependencies leads to a 13.7% reduction …


Classification Models For New Event Detection, Girdhar Kumaran, James Allan, Andrew Mccallum Jan 2004

Classification Models For New Event Detection, Girdhar Kumaran, James Allan, Andrew Mccallum

Andrew McCallum

New event detection (NED) involves monitoring news streams to detect the stories that report on new events. In this paper we explore the application of machine learning classification techniques for this task. We introduce the concept of triangulation with illustrative examples. We develop new features that build on this concept, and the named entities present in a document. The classifiers we developed showed significant and consistent improvement over the baseline vector space model system, on all the collections we tested on. Analysis of the performance of our classifiers suggests the utility of named entities, and the applicability of machine learning …


Accurate Information Extraction From Research Papers Using Conditional Random Fields, Fuchun Peng, Andrew Mccallum Jan 2004

Accurate Information Extraction From Research Papers Using Conditional Random Fields, Fuchun Peng, Andrew Mccallum

Andrew McCallum

With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This paper employs Conditional Random Fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. The basic theory of CRFs is becoming well-understood, but best-practices for applying them to real-world data require additional exploration. This paper makes an empirical exploration of several factors, including variations on Gaussian, exponential and hyperbolic-L1 priors for improved regularization, and several classes of features and Markov order. On a …


A Note On Semi-Supervised Learning Using Markov Random Fields, Wei Li, Andrew Mccallum Jan 2004

A Note On Semi-Supervised Learning Using Markov Random Fields, Wei Li, Andrew Mccallum

Andrew McCallum

This paper describes conditional-probability training of Markov random fields using combinations of labeled and unlabeled data. We capture the similarities between instances learning the appropriate distance metric from the data. The likelihood model and several training procedures are presented.


Chinese Segmentation And New Word Detection Using Conditional Random Fields, Fuchun Peng, Fangfang Feng, Andrew Mccallum Jan 2004

Chinese Segmentation And New Word Detection Using Conditional Random Fields, Fuchun Peng, Fangfang Feng, Andrew Mccallum

Andrew McCallum

Chinese word segmentation is a difficult, important and widely-studied sequence modeling problem. This paper demonstrates the ability of linear-chain conditional random fields (CRFs) to perform robust and accurate Chinese word segmentation by providing a principled framework that easily supports the integration of domain knowledge in the form of multiple lexicons of characters and words. We also present a probabilistic new word detection method, which further improves performance. Our system is evaluated on four datasets used in a recent comprehensive Chinese word segmentation competition. State-of-the-art performance is obtained.