Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 4 of 4

Full-Text Articles in Physical Sciences and Mathematics

Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum Jan 2008

Unsupervised Deduplication Using Cross-Field Dependencies, Robert Hall, Charles Sutton, Andrew Mccallum

Andrew McCallum

Recent work in deduplication has shown that collective deduplication of different attribute types can improve performance. But although these techniques cluster the attributes collectively, they do not model them collectively. For example, in citations in the research literature, canonical venue strings and title strings are dependent---because venues tend to focus on a few research areas---but this dependence is not modeled by current unsupervised techniques. We call this dependence between fields in a record a cross-field dependence. In this paper, we present an unsupervised generative model for the deduplication problem that explicitly models cross-field dependence. Our model uses a single set …


Enabling Synergy Between Psychology And Natural Language Processing For E-Government: Crime Reporting And Investigative Interview System, Alicia Iriberri '06, Chih Hao Ku '12, Gondy Leroy Jan 2008

Enabling Synergy Between Psychology And Natural Language Processing For E-Government: Crime Reporting And Investigative Interview System, Alicia Iriberri '06, Chih Hao Ku '12, Gondy Leroy

CGU Faculty Publications and Research

We are developing an automated crime reporting and investigative interview system. The system incorporates cognitive interview techniques to maximize witness memory recall, and information extraction technology to extract and annotate crime entities from witness narratives and interview responses. Evaluations of the IE components of the system show that it captures 70 to 77% of information from witness narratives with 93 to 100% precision. Our development goal is for the system to approximate progressively the performance effectiveness of a human investigative interviewer and to generate graphical visualizations of crime report information.


Natural Language Processing And E-Government: Crime Information Extraction From Heterogeneous Data Sources, Chih Hao Ku '12, Alicia Iriberri '06, Gondy Leroy Jan 2008

Natural Language Processing And E-Government: Crime Information Extraction From Heterogeneous Data Sources, Chih Hao Ku '12, Alicia Iriberri '06, Gondy Leroy

CGU Faculty Publications and Research

Much information that could help solve and prevent crimes is never gathered because the reporting methods available to citizens and law enforcement personnel are not optimal. Detectives do not have sufficient time to interview crime victims and witnesses. Moreover, many victims and witnesses are too scared or embarrassed to report incidents. We are developing an interviewing system that will help collect such information. We report here on one component, the crime information extraction module, which uses natural language processing to extract crime information from police reports, newspaper articles, and victims’ and witnesses’ crime narratives. We tested our approach with two …


Unsupervised Discovery Of Compound Entities For Relationship Extraction, Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang, Amit P. Sheth Jan 2008

Unsupervised Discovery Of Compound Entities For Relationship Extraction, Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang, Amit P. Sheth

Kno.e.sis Publications

In this paper we investigate unsupervised population of a biomedical ontology via information extraction from biomedical literature. Relationships in text seldom connect simple entities. We therefore focus on identifying compound entities rather than mentions of simple entities. We present a method based on rules over grammatical dependency structures for unsupervised segmentation of sentences into compound entities and relationships. We complement the rule-based approach with a statistical component that prunes structures with low information content, thereby reducing false positives in the prediction of compound entities, their constituents and relationships. The extraction is manually evaluated with respect to the UMLS Semantic Network …