Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 2 of 2
Full-Text Articles in Computer Sciences
Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum
Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum
Andrew McCallum
We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as Principal Component Analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be …
Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum
Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum
Andrew McCallum
It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …