Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 9 of 9

Full-Text Articles in Physical Sciences and Mathematics

Discovering Issue-Based Voting Groups Within The Us Senate, Rachel Shorey, Andrew Mccallum, Hanna Wallach Dec 2010

Discovering Issue-Based Voting Groups Within The Us Senate, Rachel Shorey, Andrew Mccallum, Hanna Wallach

Andrew McCallum

Members of the US Senate cast votes on a wide array of issues. Understanding a senator's position on an issue is important to constituents, sources of campaign funding, and groups seeking to persuade senators or build consensus. Classifying senators' positions often falls into the hands of interest groups. Many lobbyists and issue-based organizations give senators scores based on the number of times senators vote in accordance with the organization's ideals. Organization staff must choose which bills to consider and then investigate their content manually. To produce more objective and replicable rankings, political scientists have developed statistical models to group and …


Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum May 2010

Distantly Labeling Data For Large Scale Cross-Document Coreference, Sameer Singh, Michael Wick, Andrew Mccallum

Andrew McCallum

Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on “distantly-labeling” a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3:5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of …


Scalable Probabilistic Databases With Factor Graphs And Mcmc, Michael Wick, Andrew Mccallum, Gerome Miklau May 2010

Scalable Probabilistic Databases With Factor Graphs And Mcmc, Michael Wick, Andrew Mccallum, Gerome Miklau

Andrew McCallum

Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a distribution over possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary …


High-Performance Semi-Supervised Learning Using Discriminatively Constrained Generative Models, Gregory Druck, Andrew Mccallum Jan 2010

High-Performance Semi-Supervised Learning Using Discriminatively Constrained Generative Models, Gregory Druck, Andrew Mccallum

Andrew McCallum

We develop a semi-supervised learning algorithm that encourages generative models to discover latent structure that is relevant to a prediction task. The method constrains the posterior distribution of latent variables under a generative model to satisfy a rich set of feature expectation constraints from labeled data. We focus on the application of this method to sequence labeling and estimate parameters with a modified EM algorithm. The E-step involves estimating the parameters of a log-linear model with an HMM as the base distribution. This HMM-CRF can be used for test time prediction. The approach is related to other semi-supervised methods, but …


Collective Cross-Document Relation Extraction Without Labelled Data, Limin Yao, Sebastian Riedel, Andrew Mccallum Jan 2010

Collective Cross-Document Relation Extraction Without Labelled Data, Limin Yao, Sebastian Riedel, Andrew Mccallum

Andrew McCallum

We present a novel approach to relation extraction that integrates information across documents, performs global inference and requires no labelled text. In particular, we tackle relation extraction and entity identification jointly. We use distant supervision to train a factor graph model for relation extraction based on an existing knowledge base (Freebase, derived in parts from Wikipedia). For inference we run an efficient Gibbs sampler that leads to linear time joint inference. We evaluate our approach both for an in-domain (Wikipedia) and a more realistic out-of-domain (New York Times Corpus) setting. For the in-domain setting, our joint model leads to 4% …


Modeling Relations And Their Mentions Without Labeled Text, Sebastian Riedel, Limin Yao, Andrew Mccallum Jan 2010

Modeling Relations And Their Mentions Without Labeled Text, Sebastian Riedel, Limin Yao, Andrew Mccallum

Andrew McCallum

Several recent works on relation extraction have been applying the distant supervision paradigm: instead of relying on annotated text to learn how to predict relations, they employ existing knowledge bases (KBs) as source of supervision. Crucially, these approaches are trained based on the assumption that each sentence which mentions the two related entities is an expression of the given relation. Here we argue that this leads to noisy patterns that hurt precision, in particular if the knowledge base is not directly related to the text we are working with. We present a novel approach to distant supervision that can alleviate …


Constraint-Driven Rank-Based Learning For Information Extraction, Sameer Singh, Limin Yao, Sebastian Riedel, Andrew Mccallum Jan 2010

Constraint-Driven Rank-Based Learning For Information Extraction, Sameer Singh, Limin Yao, Sebastian Riedel, Andrew Mccallum

Andrew McCallum

Most learning algorithms for factor graphs require complete inference over the dataset or an instance before making an update to the parameters. SampleRank is a rank-based learning framework that alleviates this problem by updating the parameters during inference. Most semi-supervised learning algorithms also rely on the complete inference, i.e. calculating expectations or MAP configurations. We extend the SampleRank framework to the semi-supervised learning, avoiding these inference bottlenecks. Different approaches for incorporating unlabeled data and prior knowledge into this framework are explored. We evaluated our method on a standard information extraction dataset. Our approach outperforms the supervised method significantly and matches …


Resource-Bounded Information Extraction: Acquiring Missing Feature Values On Demand, Pallika Kanani, Andrew Mccallum, Shaohan Hu Jan 2010

Resource-Bounded Information Extraction: Acquiring Missing Feature Values On Demand, Pallika Kanani, Andrew Mccallum, Shaohan Hu

Andrew McCallum

We present a general framework for the task of extracting specific information ``on demand'' from a large corpus such as the Web under resource-constraints. Given a database with missing or uncertain information, the proposed system automatically formulates queries, issues them to a search interface, selects a subset of the documents, extracts the required information from them, and fills the missing values in the original database. We also exploit inherent dependency within the data to obtain useful information with fewer computational resources. We build such a system in the citation database domain that extracts the missing publication years using limited resources …


Optimizing Semantic Coherence In Topic Models, D. Mimno, H. Wallach, E. Talley, M. Leenders, Andrew Mccallum Jan 2010

Optimizing Semantic Coherence In Topic Models, D. Mimno, H. Wallach, E. Talley, M. Leenders, Andrew Mccallum

Andrew McCallum

Large organizations often face the critical challenge of sharing information and maintaining connections between disparate subunits. Tools for automated analysis of document collections, such as topic models, can provide an important means for communication. The value of topic modeling is in its ability to discover interpretable, coherent themes from unstructured document sets, yet it is not unusual to find semantic mismatches that substantially reduce user confidence. In this paper, we first present an expert-driven topic annotation study, undertaken in order to obtain an annotated set of baseline topics and their distinguishing characteristics. We then present a metric for detecting poor-quality …