Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 16 of 16

Full-Text Articles in Computer Sciences

A Conditional Model Of Deduplication For Multi-Type Relational Data, Aron Culotta, Andrew Mccallum Jan 2005

A Conditional Model Of Deduplication For Multi-Type Relational Data, Aron Culotta, Andrew Mccallum

Andrew McCallum

Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. We evaluate the system on two citation matching datasets, for …


Group And Topic Discovery From Relations And Text, Xuerui Wang, Natasha Mohanty, Andrew Mccallum Jan 2005

Group And Topic Discovery From Relations And Text, Xuerui Wang, Natasha Mohanty, Andrew Mccallum

Andrew McCallum

We present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. Block-models of relationship data have been studied in social network analysis for some time. Here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. Significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. We present experimental results on two large data sets: sixteen years of bills put before the U.S. Senate, comprising their corresponding text and voting records, and 43 years …


Multi-Way Distributional Clustering Via Pairwise Interactions, Ron Bekkerman, Ran El-Yaniv, Andrew Mccallum Jan 2005

Multi-Way Distributional Clustering Via Pairwise Interactions, Ron Bekkerman, Ran El-Yaniv, Andrew Mccallum

Andrew McCallum

We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves top-down clustering of some variables and bottom-up clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of two-way, three-way and four-way applications of our …


A Note On Topical N-Grams, Xuerui Wang, Andrew Mccallum Jan 2005

A Note On Topical N-Grams, Xuerui Wang, Andrew Mccallum

Andrew McCallum

Most of the popular topic models (such as Latent Dirichlet Allocation) have an underlying assumption: bag of words. However, text is indeed a sequence of discrete word tokens, and without considering the order of words (in another word, the nearby context where a word is located), the accurate meaning of language cannot be exactly captured by word co-occurrences only. In this sense, collocations of words (phrases) have to be considered. However, like individual words, phrases sometimes show polysemy as well depending on the context. More noticeably, a composition of two (or more) words is a phrase in some context, but …


Automatic Categorization Of Email Into Folders: Benchmark Experiments On Enron And Sri Corpora, Ron Bekkerman, Andrew Mccallum, Gary Huang Jan 2005

Automatic Categorization Of Email Into Folders: Benchmark Experiments On Enron And Sri Corpora, Ron Bekkerman, Andrew Mccallum, Gary Huang

Andrew McCallum

Office workers everywhere are drowning in email---not only spam, but also large quantities of legitimate email to be read and organized for browsing. Although there have been extensive investigations of automatic document categorization, email gives rise to a number of unique challenges, and there has been relatively little study of classifying email into folders. This paper presents an extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron employees, another from participants in an SRI research project. We discuss the challenges that arise from differences between email foldering and …


Composition Of Conditional Random Fields For Transfer Learning, Charles Sutton, Andrew Mccallum Jan 2005

Composition Of Conditional Random Fields For Transfer Learning, Charles Sutton, Andrew Mccallum

Andrew McCallum

Many learning tasks have subtasks for much training data exists. Therefore, we want to transfer learning from the old, general-purpose subtask to a more specific new task, for which there is often less data. While work in transfer learning often considers how the old task should affect learning on the new task, in this paper we show that it helps to take into account how the new task affects the old. Specifically, we perform joint decoding of separately-trained sequence models, preserving uncertainty between the tasks and allowing information from the new task to affect predictions on the old task. On …


Feature Bagging: Preventing Weight Undertraining In Structured Discriminative Learning, Charles Sutton, Michael Sindelar, Andrew Mccallum Jan 2005

Feature Bagging: Preventing Weight Undertraining In Structured Discriminative Learning, Charles Sutton, Michael Sindelar, Andrew Mccallum

Andrew McCallum

Discriminatively-trained probabilistic models are widely useful because of the latitude they afford in designing features. But training involves complex trade-offs among weights, which can be dangerous: a few highly-indicative features can swamp the contribution of many individually weaker features, causing their weights to be undertrained. Such a model is less robust, for the highly-indicative features may be noisy or missing in the test data. To ameliorate this \emph{weight undertraining}, we propose a new training method, called \emph{feature bagging}, in which separate models are trained on subsets of the original features, and combined using a mixture model or a product of …


Sparse Forward-Backward For Fast Training Of Conditional Random Fields, Charles Sutton, Chris Pal, Andrew Mccallum Jan 2005

Sparse Forward-Backward For Fast Training Of Conditional Random Fields, Charles Sutton, Chris Pal, Andrew Mccallum

Andrew McCallum

Complex tasks in speech and language processing often include random variables with large state spaces, both in speech tasks that involve predicting words and phonemes, and in joint processing of pipelined systems, in which the state space can be the labeling of an entire sequence. In large state spaces, however, discriminative training can be expensive, because it often requires many calls to forward-backward. Beam search is a standard heuristic for controlling complexity during Viterbi decoding, but during forward-backward, standard beam heuristics can be dangerous, as they can make training unstable. We introduce sparse forward-backward, a variational perspective on beam methods …


Direct Maximization Of Rank-Based Metrics For Information Retrieval, Donald A. Metzler, W. Bruce Croft, Andrew Mccallum Jan 2005

Direct Maximization Of Rank-Based Metrics For Information Retrieval, Donald A. Metzler, W. Bruce Croft, Andrew Mccallum

Andrew McCallum

Ranking is an essential component for a number of tasks, such as information retrieval and collaborative filtering. It is often the case that the underlying task attempts to maximize some evaluation metric, such as mean average precision, over rankings. Most past work on learning how to rank has focused on likelihood- or margin-based approaches. In this work we explore directly maximizing rank-based metrics, which are a family of metrics that only depend on the order of ranked items. This allows us to maximize different metrics for the same training data. We show how the parameter space of linear scoring functions …


Reducing Labeling Effort For Structured Prediction Tasks, Aron Culotta, Andrew Mccallum Jan 2005

Reducing Labeling Effort For Structured Prediction Tasks, Aron Culotta, Andrew Mccallum

Andrew McCallum

A common obstacle preventing the rapid deployment of supervised machine learning algorithms is the lack of labeled training data. This is particularly expensive to obtain for structured prediction tasks, where each training instance may have multiple, interacting labels, all of which must be correctly annotated for the instance to be of use to the learner. Traditional active learning addresses this problem by optimizing the order in which the examples are labeled to increase learning efficiency. However, this approach does not consider the difficulty of labeling each example, which can vary widely in structured prediction tasks. For example, the labeling predicted …


Joint Parsing And Semantic Role Labeling, Charles Sutton, Andrew Mccallum Jan 2005

Joint Parsing And Semantic Role Labeling, Charles Sutton, Andrew Mccallum

Andrew McCallum

A striking feature of human syntactic processing is that it is context-dependent, that is, it seems to take into account semantic information from the discourse context and world knowledge. In this paper, we attempt to use this insight to bridge the gap between SRL results from gold parses and from automatically-generated parses. To do this, we jointly perform parsing and semantic role labeling, using a probabilistic SRL system to rerank the results of a probabilistic parser. Our current results are negative, because a locally-trained SRL model can return inaccurate probability estimates.


Gene Prediction With Conditional Random Fields, Aron Culotta, David Kulp, Andrew Mccallum Jan 2005

Gene Prediction With Conditional Random Fields, Aron Culotta, David Kulp, Andrew Mccallum

Andrew McCallum

Given a sequence of DNA nucleotide bases, the task of gene prediction is to find subsequences of bases that encode proteins. Reasonable performance on this task has been achieved using generatively trained sequence models, such as hidden Markov models. We propose instead the use of a discriminitively trained sequence model, the conditional random field (CRF). CRFs can naturally incorporate arbitrary, non-independent features of the input without making conditional independence assumptions among the features. This can be particularly important for gene finding, where including evidence from protein databases, EST data, or tiling arrays may improve accuracy. We evaluate our model on …


Disambiguating Web Appearances Of People In A Social Network, Ron Bekkerman, Andrew Mccallum Jan 2005

Disambiguating Web Appearances Of People In A Social Network, Ron Bekkerman, Andrew Mccallum

Andrew McCallum

You are looking for information about a particular person. A search engine returns many pages for that person's name, but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the web pages, another using the recently introduced Bootstrapped Information Bottleneck (BIB) clustering method. To evaluate our methods, we collected …


Reducing Weight Undertraining In Structured Discriminative Learning, Charles Sutton, Michael Sindelar, Andrew Mccallum Jan 2005

Reducing Weight Undertraining In Structured Discriminative Learning, Charles Sutton, Michael Sindelar, Andrew Mccallum

Andrew McCallum

Discriminative probabilistic models are very popular in NLP because of the latitude they afford in designing features. But training involves complex trade-offs among weights, which can be dangerous: a few highly-indicative features can swamp the contribution of many individually weaker features, causing their weights to be undertrained. Such a model is less robust, for the highly-indicative features may be noisy or missing in the test data. To ameliorate this weight undertraining, we introduce several new feature bagging methods, in which separate models are trained on subsets of the original features, and combined using a mixture model or a product of …


Fast, Piecewise Training For Discriminative Finite-State And Parsing Models, Charles Sutton, Andrew Mccallum Jan 2005

Fast, Piecewise Training For Discriminative Finite-State And Parsing Models, Charles Sutton, Andrew Mccallum

Andrew McCallum

Discriminitive models for sequences and trees---such as linear-chain conditional random fields (CRFs) and max-margin parsing---have shown great promise because they combine the ability to incorporate arbitrary\comment{ non-independent} input features and the benefits of principled global inference over their structured outputs. However, since parameter estimation in these models involves repeatedly performing this global inference, training can be very slow. We present {\it piecewise training}, a new training method that combines the speed of local training with the accuracy of global training by incorporating a limited amount of global information derived from previous errors of the model. On named-entity and part-of-speech data, …


Piecewise Training For Undirected Models, Charles Sutton, Andrew Mccallum Jan 2005

Piecewise Training For Undirected Models, Charles Sutton, Andrew Mccallum

Andrew McCallum

For many large undirected models that arise in real-world applications, exact maximumlikelihood training is intractable, because it requires computing marginal distributions of the model. Conditional training is even more difficult, because the partition function depends not only on the parameters, but also on the observed input, requiring repeated inference over each training example. An appealing idea for such models is to independently train a local undirected classifier over each clique, afterwards combining the learned weights into a single global model. In this paper, we show that this piecewise method can be justified as minimizing a new family of upper bounds …