Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 22 of 22

Full-Text Articles in Computer Sciences

Mining A Digital Library For Influential Authors, David Mimno, Andrew Mccallum Jan 2007

Mining A Digital Library For Influential Authors, David Mimno, Andrew Mccallum

Andrew McCallum

When browsing a digital library of research papers, it is natural to ask which authors are most influential in a particular topic. We present a probabilistic model that ranks authors based on their influence in particular areas of scientific research. This model combines several sources of information: citation information between documents as represented by PageRank scores, authorship data gathered through automatic information extraction, and the words in paper abstracts. We propose a topic model on the words, and compare performance versus a smoothed language model by assessing the number of major award winners in the resulting ranked list of researchers.


Resource-Bounded Information Gathering For Correlation Clustering, Pallika Kanani, Andrew Mccallum Jan 2007

Resource-Bounded Information Gathering For Correlation Clustering, Pallika Kanani, Andrew Mccallum

Andrew McCallum

We present a new class of problems, called Resource-bounded Information Gathering for Correlation Clustering. Our goal is to perform correlation clustering on a graph with incomplete information. The missing information can be obtained by querying an external source under constrained resources. The problem is to develop the most effective strategy for querying to achieve optimal clustering. We describe the problem using entity resolution as an example task.


Expertise Modeling For Matching Papers With Reviewers, David Mimno, Andrew Mccallum Jan 2007

Expertise Modeling For Matching Papers With Reviewers, David Mimno, Andrew Mccallum

Andrew McCallum

An essential part of an expert-finding task, such as matching reviewers to submitted papers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based approaches with a novel topic model, Author-Persona-Topic (APT). In this model, each author can write under one or more ``personas,'' which are represented as independent distributions over hidden topics. Examples of previous papers written by prospective reviewers are gathered from the Rexa database, which extracts …


Sparse Message Passing Algorithms Forweighted Maximum Satisfiability, Aron Culotta, Andrew Mccallum, Bart Selman, Ashish Sabharwal Jan 2007

Sparse Message Passing Algorithms Forweighted Maximum Satisfiability, Aron Culotta, Andrew Mccallum, Bart Selman, Ashish Sabharwal

Andrew McCallum

Weighted maximum satisfiability is a well-studied problem that has important applicability to artificial intelligence (for instance, MPE inference in Bayesian networks). General-purpose stochastic search algorithms have proven to be accurate and efficient for large problem instances; however, these algorithms largely ignore structural properties of the input. For example, many problems are highly clustered, in that they contain a collection of loosely coupled subproblems (e.g. pipelines of NLP tasks). In this paper, we propose a message passing algorithm to solve weighted maximum satisfiability problems that exhibit this clustering property. Our algorithm fuses local solutions to each subproblem into a global solution …


Semi-Supervised Classification With Hybrid Generative/Discriminative Methods, Gregory Druck, Chris Pal, Xiaoping Zhu, Andrew Mccallum Jan 2007

Semi-Supervised Classification With Hybrid Generative/Discriminative Methods, Gregory Druck, Chris Pal, Xiaoping Zhu, Andrew Mccallum

Andrew McCallum

In this paper, we study semi-supervised learning using hybrid generative/discriminative methods. Specifically, we compare two recently proposed frameworks for combining generative and discriminative classifiers and apply them to semi-supervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of labeled data and a generative likelihood of unlabeled data. While prominent semi-supervised learning methods assume low density regions between classes or are subject to generative modeling assumptions, hybrid generative/discriminative methods allow semi-supervised learning in the presence of strongly overlapping classes and reduce the risk of modeling structure in the unlabeled data that is irrelevant for the specific …


Improving Author Coreference By Resource-Bounded Information Gathering From Theweb, Pallika Kanani, Andrew Mccallum, Chris Pal Jan 2007

Improving Author Coreference By Resource-Bounded Information Gathering From Theweb, Pallika Kanani, Andrew Mccallum, Chris Pal

Andrew McCallum

Accurate entity resolution is sometimes impossible simply due to insufficient information. For example, in research paper author name resolution, even clever use of venue, title and co-authorship relations are often not enough to make a confident coreference decision. This paper presents several methods for increasing accuracy by gathering and integrating additional evidence from the web. We formulate the coreference problem as one of graph partitioning with discriminatively-trained edge weights, and then incorporate web information either as additional features or as additional nodes in the graph. Since the web is too large to incorporate all its data, we need an efficient …


Topical N-Grams: Phrase And Topic Discovery, With An Application To Information Retrieval, Xuerui Wang, Andrew Mccallum, Xing Wei Jan 2007

Topical N-Grams: Phrase And Topic Discovery, With An Application To Information Retrieval, Xuerui Wang, Andrew Mccallum, Xing Wei

Andrew McCallum

Most topic models, such as latent Dirichlet allocation, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. This paper presents Topical N-grams, a topic model that discovers topics as well as the individual words and phrases that define their meaning. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can represent that the phrase ``white house'' has …


Lightly-Supervised Attribute Extraction, Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew Mccallum, Mark Dredze Jan 2007

Lightly-Supervised Attribute Extraction, Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew Mccallum, Mark Dredze

Andrew McCallum

Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries.


Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum Jan 2007

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum

Andrew McCallum

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as Principal Component Analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be …


Penn/Umass/Chop Biocreative Ii Systems, Kuzman Ganchev, Koby Crammer, Fernando Pereira, Gideon Mann, Kedar Bellare, Andrew Mccallum, Steve Carroll, Yang Jin, Peter White Jan 2007

Penn/Umass/Chop Biocreative Ii Systems, Kuzman Ganchev, Koby Crammer, Fernando Pereira, Gideon Mann, Kedar Bellare, Andrew Mccallum, Steve Carroll, Yang Jin, Peter White

Andrew McCallum

Our team participated in the entity tagging and normalization tasks of Biocreative II. For the entity tagging task, we used a k-best MIRA learning algorithm with lexicons and automatically derived word clusters. MIRA accommodates different training loss functions, which allowed us to exploit gene alternatives in training. We also performed a greedy search over feature templates and the development data, achieving a final F-measure of 86.28%. For the normalization task, we proposed a new specialized on-line learning algorithm and applied it for filtering out false positives from a high recall list of candidates. For normalization we received an F-measure of …


Efficient Strategies For Improving Partitioning-Based Author Coreference By Incorporating Web Pages As Graph Nodes, Pallika Kanani, Andrew Mccallum Jan 2007

Efficient Strategies For Improving Partitioning-Based Author Coreference By Incorporating Web Pages As Graph Nodes, Pallika Kanani, Andrew Mccallum

Andrew McCallum

Entity resolution in the research paper domain is an important, but difficult problem. It suffers from insufficient contextual information, hence using information from the web significantly improves performance. We formulate the author coreference problem as one of graph partitioning with discriminatively-trained edge weights. Building on our previous work, we present improved and more comprehensive results for the method in which we incorporate web documents as additional nodes in the graph. We also propose efficient strategies to select a subset of nodes to add to the graph and to select a subset of queries to gather additional nodes, without significant loss …


Mixtures Of Hierarchical Topics With Pachinko Allocation, David Mimno, Wei Li, Andrew Mccallum Jan 2007

Mixtures Of Hierarchical Topics With Pachinko Allocation, David Mimno, Wei Li, Andrew Mccallum

Andrew McCallum

The four-level Pachinko Allocation model (PAM) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specific topics. This paper presents Hierarchical PAM---an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals …


Cryptogram Decoding For Ocr Using Numerization Strings, Gary Huang, Erik Learned-Miller, Andrew Mccallum Jan 2007

Cryptogram Decoding For Ocr Using Numerization Strings, Gary Huang, Erik Learned-Miller, Andrew Mccallum

Andrew McCallum

OCR systems for printed documents typically require large numbers of font styles and character models to work well. When given an unseen font, performance degrades even in the absence of noise. In this paper, we perform OCR in an unsupervised fashion without using any character models by using a cryptogram decoding algorithm. We present results on real and artificial OCR data.


Organizing The Oca: Learning Faceted Subjects From A Library Of Digital Books, David Mimno, Andrew Mccallum Jan 2007

Organizing The Oca: Learning Faceted Subjects From A Library Of Digital Books, David Mimno, Andrew Mccallum

Andrew McCallum

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent ``topics'' that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model …


Efficient Computation Of Entropy Gradient For Semi-Supervised Conditional Random Fields, Gideon S. Mann, Andrew Mccallum Jan 2007

Efficient Computation Of Entropy Gradient For Semi-Supervised Conditional Random Fields, Gideon S. Mann, Andrew Mccallum

Andrew McCallum

Entropy regularization is a straightforward and successful method of semi-supervised learning that augments the traditional conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. It has previously been demonstrated to provide positive results in linear-chain CRFs, but the published method for calculating the entropy gradient requires significantly more computation than supervised CRF training. This paper presents a new derivation and dynamic program for calculating the entropy gradient that is significantly more efficient---having the same asymptotic time complexity as supervised CRF training. We also present efficient generalizations of this method for …


Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum Jan 2007

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …


Improved Dynamic Schedules For Belief Propagation, Charles Sutton, Andrew Mccallum Jan 2007

Improved Dynamic Schedules For Belief Propagation, Charles Sutton, Andrew Mccallum

Andrew McCallum

Belief propagation and its variants are popular methods for approximate inference, but their running time and even their convergence depend greatly on the schedule used to send the messages. Recently, dynamic update schedules have been shown to converge much faster on hard networks than static schedules, namely the residual BP schedule of Elidan et al. [2006]. But that RBP algorithm wastes message updates: many messages are computed solely to determine their priority, and are never actually performed. In this paper, we show that estimating the residual, rather than calculating it directly, leads to significant decreases in the number of messages …


Learning Extractors From Unlabeled Text Using Relevant Databases, Kedar Bellare, Andrew Mccallum Jan 2007

Learning Extractors From Unlabeled Text Using Relevant Databases, Kedar Bellare, Andrew Mccallum

Andrew McCallum

Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled …


Nonparametric Bayes Pachinko Allocation, Wei Li, David Blei, Andrew Mccallum Jan 2007

Nonparametric Bayes Pachinko Allocation, Wei Li, David Blei, Andrew Mccallum

Andrew McCallum

Amherst, MA 01003 David Blei Computer Science Department Princeton University Princeton, NJ 08540 Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA 01003 Abstract Recent advances in topic models have explored complicated structured distributions to represent topic correlation. For example, the pachinko allocation model (PAM) captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). While PAM provides more flexibility and greater expressive power than previous models like latent Dirichlet allocation (LDA), it is also more difficult to determine the appropriate topic structure for a specific dataset. In this paper, we propose a …


Leveraging Existing Resources Using Generalized Expectation Criteria, Gregory Druck, Gideon Mann, Andrew Mccallum Jan 2007

Leveraging Existing Resources Using Generalized Expectation Criteria, Gregory Druck, Gideon Mann, Andrew Mccallum

Andrew McCallum

It is difficult to apply machine learning to many real-world tasks because there are no existing labeled instances. In one solution to this problem, a human expert provides instance labels that are used in traditional supervised or semi-supervised training. Instead, we want a solution that allows us to leverage existing resources other than complete labeled instances. We propose the use of generalized expectation (GE) criteria to achieve this goal. A GE criterion is a term in a training objective function that assigns a score to values of a model expectation. In this paper, the expectations are model predicted class distributions …


Undirected And Interpretable Continuous Topic Models Of Documents, X. Wang, K. Crammer, Andrew Mccallum Jan 2007

Undirected And Interpretable Continuous Topic Models Of Documents, X. Wang, K. Crammer, Andrew Mccallum

Andrew McCallum

We propose a new type of undirected graphical model suitable for topic modeling and dimensionality reduction for large text collections. Unlike previous Boltzmann machine and harmonium based methods, this new model represents words using Discrete distributions akin to traditional `bag-of-words' methods. However, in contrast to directed topic models such as latent Dirichlet allocation, each word is drawn from a distribution that takes into account all possible topics, as opposed to a topic-specific distribution. Furthermore, our models use positive continuous valued latent variables and learn more interpretable latent topic spaces than previous undirected techniques. As other undirected models, once such models …


Author Disambiguation Using Error-Driven Machine Learning With A Ranking Loss Function, Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew Mccallum Jan 2007

Author Disambiguation Using Error-Driven Machine Learning With A Ranking Loss Function, Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew Mccallum

Andrew McCallum

Author disambiguation is the problem of determining whether records in a publications database that contain similar author names refer to the same person. This task can be especially difficult when the database is constructed from automatically extracted data, which can contain noisy and incomplete records. A common supervised machine learning approach to author disambiguation is to build a classifier that predicts whether a pair of records is coreferent, often followed by a collective inference step to enforce transitivity of the predictions. By restricting the classifier to pairwise predictions, standard training algorithms for binary classification can be used. However, this approach …