Physical Sciences and Mathematics | Open Access Articles

Incremental Non-Greedy Clustering At Scale, Nicholas Monath Mar 2022

Incremental Non-Greedy Clustering At Scale, Nicholas Monath

Doctoral Dissertations

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions. Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental …

Go to article

Compact Representations Of Uncertainty In Clustering, Craig Stuart Greenberg Apr 2021

Compact Representations Of Uncertainty In Clustering, Craig Stuart Greenberg

Doctoral Dissertations

Flat clustering and hierarchical clustering are two fundamental tasks, often used to discover meaningful structures in data, such as subtypes of cancer, phylogenetic relationships, taxonomies of concepts, and cascades of particle decays in particle physics. When multiple clusterings of the data are possible, it is useful to represent uncertainty in clustering through various probabilistic quantities, such as the distribution over partitions or tree structures, and the marginal probabilities of subpartitions or subtrees. Many compact representations exist for structured prediction problems, enabling the efficient computation of probability distributions, e.g., a trellis structure and corresponding Forward-Backward algorithm for Markov models that model …

Go to article

Reasoning About User Feedback Under Identity Uncertainty In Knowledge Base Construction, Ariel Kobren Dec 2020

Reasoning About User Feedback Under Identity Uncertainty In Knowledge Base Construction, Ariel Kobren

Doctoral Dissertations

Intelligent, automated systems that are intertwined with everyday life---such as Google Search and virtual assistants like Amazon’s Alexa or Apple’s Siri---are often powered in part by knowledge bases (KBs), i.e., structured data repositories of entities, their attributes, and the relationships among them. Despite a wealth of research focused on automated KB construction methods, KBs are inevitably imperfect, with errors stemming from various points in the construction pipeline. Making matters more challenging, new data is created daily and must be integrated with existing KBs so that they remain up-to-date. As the primary consumers of KBs, human users have tremendous potential to …

Go to article

A Proportionality-Based Approach To Search Result Diversification, Van Bac Dang Aug 2014

A Proportionality-Based Approach To Search Result Diversification, Van Bac Dang

Doctoral Dissertations

Search result diversification addresses the problem of queries with unclear information needs. The aim of using diversification techniques is to find a ranking of documents that covers multiple possible interpretations, aspects, or topics for a given query. By explicitly providing diversity in search results, this approach can increase the likelihood that users will find documents relevant to their specific intent, thereby improving effectiveness. This dissertation introduces a new perspective on diversity: diversity by proportionality. We consider a result list more diverse, with respect to some set of topics related to the query, when the ratio between the number of relevant …

Go to article

Bibliometric Impact Measures Leveraging Topic Analysis, Gideon S. Mann, David Mimno, Andrew Mccallum Jan 2006

Bibliometric Impact Measures Leveraging Topic Analysis, Gideon S. Mann, David Mimno, Andrew Mccallum

Andrew McCallum

Measurements of the impact and history of research literature provide a useful complement to scientific digital library collections. Bibliometric indicators have been extensively studied, mostly in the context of journals. However, journal-based metrics poorly capture topical distinctions in fast-moving fields, and are increasingly problematic in the context of open-access publishing. Recent developments in latent topic models have produced promising results for automatic sub-field discovery. The fine-grained, faceted topics produced by such models provide a more clear view of the topical divisions of a body of research literature and the interactions between those divisions. We demonstrate the usefulness of topic models …

Go to article

Bayesian Clustering By Dynamics, Marco Ramoni, Paola Sebastiani, Paul Cohen Jan 2001

Bayesian Clustering By Dynamics, Marco Ramoni, Paola Sebastiani, Paul Cohen

Computer Science Department Faculty Publication Series

This paper introduces a Bayesian method for clustering dynamic processes. The method models dynamics as Markov chains and then applies an agglomerative clustering procedure to discover the most probable set of clusters capturing different dynamics. To increase ef£ciency, the method uses an entropy-based heuristic search strategy. A controlled experiment suggests that the method is very accurate when applied to artificial time series in a broad range of conditions and, when applied to clustering sensor data from mobile robots, it produces clusters that are meaningful in the domain of application.

Go to article

Physical Sciences and Mathematics Commons^™

Full-Text Articles in Physical Sciences and Mathematics

Incremental Non-Greedy Clustering At Scale, Nicholas Monath

Doctoral Dissertations

Compact Representations Of Uncertainty In Clustering, Craig Stuart Greenberg

Doctoral Dissertations

Reasoning About User Feedback Under Identity Uncertainty In Knowledge Base Construction, Ariel Kobren

Doctoral Dissertations

A Proportionality-Based Approach To Search Result Diversification, Van Bac Dang

Doctoral Dissertations

Bibliometric Impact Measures Leveraging Topic Analysis, Gideon S. Mann, David Mimno, Andrew Mccallum

Andrew McCallum

Bayesian Clustering By Dynamics, Marco Ramoni, Paola Sebastiani, Paul Cohen

Computer Science Department Faculty Publication Series