Open Access. Powered by Scholars. Published by Universities.®

Social and Behavioral Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics

Faculty Publications

OCR

Publication Year

Articles 1 - 2 of 2

Full-Text Articles in Social and Behavioral Sciences

A Synthetic Document Image Dataset For Developing And Evaluating Historical Document Processing Methods, Daniel Walker, William Lund, Eric Ringger Jan 2012

A Synthetic Document Image Dataset For Developing And Evaluating Historical Document Processing Methods, Daniel Walker, William Lund, Eric Ringger

Faculty Publications

Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompanying resources are rare and often relatively small. We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation …


Evaluating Models Of Latent Document Semantics In The Presence Of Ocr Errors, Daniel D. Walker, William B. Lund, Eric K. Ringger Jan 2010

Evaluating Models Of Latent Document Semantics In The Presence Of Ocr Errors, Daniel D. Walker, William B. Lund, Eric K. Ringger

Faculty Publications

Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common …