Open Access. Powered by Scholars. Published by Universities.®

Social and Behavioral Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 11 of 11

Full-Text Articles in Social and Behavioral Sciences

How Well Does Multiple Ocr Error Correction Generalize?, William B. Lund, Eric K. Ringger, Daniel D. Walker Jan 2014

How Well Does Multiple Ocr Error Correction Generalize?, William B. Lund, Eric K. Ringger, Daniel D. Walker

Faculty Publications

As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the …


Building An Access Database For Cookstove Research, Margaret L. Weddle Aug 2013

Building An Access Database For Cookstove Research, Margaret L. Weddle

Student Works

This paper takes the reader through the thought process and actual instructions to create your own Microsoft Access database, or how to use the one provided with this paper. Also, instructions to use the HBLL resources of Compendex and RefWorks are covered. While this work was built specifically for Cookstoves research, it could be adapted to any research where you would need to maintain a record of the journal articles that you are using. It has been discovered that building a database is a time consuming and difficult work, but once done, Access provides an easy way to work with …


A Synthetic Document Image Dataset For Developing And Evaluating Historical Document Processing Methods, Daniel Walker, William Lund, Eric Ringger Jan 2012

A Synthetic Document Image Dataset For Developing And Evaluating Historical Document Processing Methods, Daniel Walker, William Lund, Eric Ringger

Faculty Publications

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation …


Evaluating Models Of Latent Document Semantics In The Presence Of Ocr Errors, Daniel D. Walker, William B. Lund, Eric K. Ringger Jan 2010

Evaluating Models Of Latent Document Semantics In The Presence Of Ocr Errors, Daniel D. Walker, William B. Lund, Eric K. Ringger

Faculty Publications

Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common …


A Sophisticated Library Search Strategy Using Folksonomies And Similarity Matching, William Lund, Yiu-Kai D. Ng, Maria Soledad Pera Jul 2009

A Sophisticated Library Search Strategy Using Folksonomies And Similarity Matching, William Lund, Yiu-Kai D. Ng, Maria Soledad Pera

Faculty Publications

Libraries, private and public, offer valuable resources to library patrons. As of today the only way to locate information archived exclusively in libraries is through their catalogs. Library patrons, however, often find it difficult to formulate a proper query, which requires using specific keywords assigned to different fields of desired library catalog records, to obtain relevant results. These improperly formulated queries often yield irrelevant results or no results at all. This negative experience in dealing with existing library systems turn library patrons away from library catalogs; instead, they rely on Web search engines to perform their searches first and upon …


Generating Ontologies Via Language Components And Ontology Reuse, Deryle W. Lonsdale, Yihong Ding, David W. Embley, Martin Hepp, Li Xu Jan 2007

Generating Ontologies Via Language Components And Ontology Reuse, Deryle W. Lonsdale, Yihong Ding, David W. Embley, Martin Hepp, Li Xu

Faculty Publications

Realizing the Semantic Web involves creating ontologies, a tedious and costly challenge. Reuse can reduce the cost of ontology engineering. Semantic Web ontologies can provide useful input for ontology reuse. However, the automated reuse of such ontologies remains underexplored. This paper presents a generic architecture for automated ontology reuse. With our implementation of this architecture, we show the practicality of automating ontology generation through ontology reuse. We experimented with a large generic ontology as a basis for automatically generating domain ontologies that fit the scope of sample natural-language web pages. The results were encouraging, resulting in five lessons pertinent to …


Analogical Modeling: An Update, Deryle W. Lonsdale, David Eddington Jan 2007

Analogical Modeling: An Update, Deryle W. Lonsdale, David Eddington

Faculty Publications

Analogical modeling is a supervised exemplar-based approach that has been widely applied to predict linguistic behavior. The paradigm has been well documented in the linguistics and cognition literature, but is less well known to the machine learning community. This paper sets out some of the basics of the approach, including a simplified example of the fundamental algorithm’s operation. It then surveys some of the recent analogical modeling language applications, and sketches how the computational system has been enhanced lately to offer users increased flexibility and processing power. Some comparisons and contrasts are drawn between analogical modeling and other language modeling …


A Cognitive Robotics Approach To Comprehending Human Language And Behaviors, Deryle W. Lonsdale, D. Paul Benjamin, Damian Lyons Jan 2007

A Cognitive Robotics Approach To Comprehending Human Language And Behaviors, Deryle W. Lonsdale, D. Paul Benjamin, Damian Lyons

Faculty Publications

The ADAPT project is a collaboration of researchers in linguistics, robotics and artificial intelligence at three universities. We are building a complete robotic cognitive architecture for a mobile robot designed to interact with humans in a range of environments, and which uses natural language and models human behavior. This paper concentrates on the HRI aspects of ADAPT, and especially on how ADAPT models and interacts with humans.


Integrating Perception, Language And Problem Solving In A Cognitive Agent For A Mobile Robot., Deryle W. Lonsdale, D. Paul Benjamin, Damian M. Lyons Jan 2004

Integrating Perception, Language And Problem Solving In A Cognitive Agent For A Mobile Robot., Deryle W. Lonsdale, D. Paul Benjamin, Damian M. Lyons

Faculty Publications

We are implementing a unified cognitive architecture for a mobile robot. Our goal is to endow a robot agent with the full range of cognitive abilities, including perception, use of natural language, learning and the ability to solve complex problems. The perspective of this work is that an architecture based on a unified theory of robot cognition has the best chance of attaining human-level performance.

This agent architecture is an integration of three theories: a theory of cognition embodied in the Soar system, the RS formal model of sensorimotor activity and an algebraic theory of decomposition and reformulation.

These three …


A Memory-Based Approach To Cantonese Tone Recognition, Deryle W. Lonsdale, Michael Emonts Jan 2003

A Memory-Based Approach To Cantonese Tone Recognition, Deryle W. Lonsdale, Michael Emonts

Faculty Publications

This paper introduces memory-based learning as a viable approach for Cantonese tone recognition. The memorybased learning algorithm employed here outperforms other documented current approaches for this problem, which is based on neural networks. Various numbers of tones and features are modeled to find the best method for feature selection and extraction. To further optimize this approach, experiments are performed to isolate the best feature weighting method, the best class voting weights method, and the best number of k-values to implement. Results and possible future work are discussed.


Peppering Knowledge Sources With Salt: Boosting Conceptual Content For Ontology Generation, Deryle W. Lonsdale, Yihong Ding, David W. Embley, Alan Melby Jan 2002

Peppering Knowledge Sources With Salt: Boosting Conceptual Content For Ontology Generation, Deryle W. Lonsdale, Yihong Ding, David W. Embley, Alan Melby

Faculty Publications

This paper describes work done to explore the common ground between two different ongoing research projects: the standardization of lexical and terminological resources, and the use of conceptual ontologies for information extraction and data integration. Specifically, this paper explores improving the generation of extraction ontologies through use of a comprehensive terminology database that has been represented in a standardized format for easy tool-based implementation. We show how, via the successful integration of these two distinct efforts, it is possible to leverage large-scale terminological and conceptual information having relationship-rich semantic resources in order to reformulate, match, and merge retrieved information of …