Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 39

Full-Text Articles in Physical Sciences and Mathematics

Creating An Improved Version Using Noisy Ocr From Multiple Editions, David Wemhoener, Ismet Zeki, R. Manmatha Dec 2012

Creating An Improved Version Using Noisy Ocr From Multiple Editions, David Wemhoener, Ismet Zeki, R. Manmatha

R. Manmatha

This paper evaluates an automated scheme for aligning and combining optical character recognition (OCR) output from three scans of a book to generate a composite version with fewer OCR errors. While there has been some previous work on aligning multiple OCR versions of the same scan, the scheme introduced in this paper does not require that scans be from the same copy of the book, or even the same edition. The three OCR outputs are combined using an algorithm which builds upon an technique which aligns two sequences at a time. In the algorithm a multiple sequence alignment of the …


On Influence Of Line Segmentation In Efficient Word Segmentation In Old Manuscripts, D. Fernández, J. Lladós, A. Fornés, R. Manmatha Dec 2011

On Influence Of Line Segmentation In Efficient Word Segmentation In Old Manuscripts, D. Fernández, J. Lladós, A. Fornés, R. Manmatha

R. Manmatha

The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothfeder [1] to segment words in old handwritten documents. In their work the lines of the documents are extracted using projections. In this work, we have developed an approach to segment lines more efficiently. The new line segmentation algorithm tackles with skewed, touching and noising lines, so it is significantly improves word segmentation. Experiments using Spanish docu- ments from the Marriages Database of the …


A Framework For Manipulating And Searching Multiple Retrieval Types, Marc-Allen Cartright, Ethem F. Can, William Dabney, Jeff Dalton, Logan Giorda, Kriste Krstovski, Xiaoye Wu, Ismet Zeki Yalniz, James Allan, R. Manmatha, David Smith Dec 2011

A Framework For Manipulating And Searching Multiple Retrieval Types, Marc-Allen Cartright, Ethem F. Can, William Dabney, Jeff Dalton, Logan Giorda, Kriste Krstovski, Xiaoye Wu, Ismet Zeki Yalniz, James Allan, R. Manmatha, David Smith

R. Manmatha

Conventional retrieval systems view documents as a unit and look at different retrieval types within a document. We introduce Proteus, a frame-work for seamlessly navigating books as dynamic collections which are defined on the fly. Proteus allows us to search various retrieval types. Navigable types include pages, books, named persons, locations, and pictures in a collection of books taken from the Internet Archive. The demonstration shows the value of multi-type browsing in dynamic collections to peruse new data.


Finding Translations In Scanned Book Collections, Ismet Zeki Yalniz, R. Manmatha Dec 2011

Finding Translations In Scanned Book Collections, Ismet Zeki Yalniz, R. Manmatha

R. Manmatha

This paper describes an approach for identifying translations of books in large scanned book collections with OCR errors. The method is based on the idea that although individual sentences do not necessarily preserve the word order when translated, a book must preserve the linear progression of ideas for it to be a valid translation. Consider two books in two different languages, say English and German. The English book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. Similarly, the book in German is represented …


Partial Duplicate Detection For Large Book Collections, Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha Dec 2010

Partial Duplicate Detection For Large Book Collections, Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha

R. Manmatha

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as ``unique words'' and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning …


A Novel Word Spotting Method Based On Recurrent Neural Networks, Volkmar Frinken, Andreas Fischer, R. Manmatha, Horst Bunke Dec 2010

A Novel Word Spotting Method Based On Recurrent Neural Networks, Volkmar Frinken, Andreas Fischer, R. Manmatha, Horst Bunke

R. Manmatha

Keyword spotting refers to the process of retrieving all instances of a given keyword from a document. In the present paper, a novel keyword spotting method for handwritten documents is described. It is derived from a neural network based system for unconstrained handwriting recognition. As such it performs template-free spotting, i.e. it is not necessary for a keyword to appear in the training set. The keyword spotting is done using a modification of the CTC Token Passing algorithm in conjunction with a recurrent neural network. We demonstrate that the proposed systems outperforms not only a classical dynamic time warping based …


An Efficient Framework For Searching Text In Noisy Document Images, Ismet Zeki Yalniz, R. Manmatha Dec 2010

An Efficient Framework For Searching Text In Noisy Document Images, Ismet Zeki Yalniz, R. Manmatha

R. Manmatha

An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches …


A Fast Alignment Scheme For Automatic Ocr Evaluation Of Books, Ismet Zeki Yalniz, R. Manmatha Dec 2010

A Fast Alignment Scheme For Automatic Ocr Evaluation Of Books, Ismet Zeki Yalniz, R. Manmatha

R. Manmatha

This paper aims to evaluate the accuracy of optical character recognition (OCR) systems on real scanned books. The ground truth e-texts are obtained from the Project Gutenberg website and aligned with their corresponding OCR output using a fast recursive text alignment scheme (RETAS). First, unique words in the vocabulary of the book are aligned with unique words in the OCR output. This process is recursively applied to each text segment in between matching unique words until the text segments become very small. In the final stage, an edit distance based alignment algorithm is used to align these short chunks of …


Mining Relational Structure From Millions Of Books, David A. Smith, R. Manmatha, James Allan Dec 2010

Mining Relational Structure From Millions Of Books, David A. Smith, R. Manmatha, James Allan

R. Manmatha

Existing large-scale scanned book collections have many short- comings for data-driven research, from OCR of variable quality to the lack of accurate descriptive and structural meta-data. We argue that complementary research in inferring relational metadata is important in its own right to support use of these collections and that it can help to mitigate other problems with scanned book collections.


Blstm Neural Network Based Word Retrieval For Hindi Documents, Raman Jain, Volkmar Frinken, C. V. Jawahar, R. Manmatha Dec 2010

Blstm Neural Network Based Word Retrieval For Hindi Documents, Raman Jain, Volkmar Frinken, C. V. Jawahar, R. Manmatha

R. Manmatha

Retrieval from Hindi document image collections is a challenging task. This is partly due to the complexity of the script, which has more than 800 unique ligatures. In addition, segmentation and recognition of individual characters often becomes difficult due to the writing style as well as degradations in the print. For these reasons, robust OCRs are non existent for Hindi. Therefore, Hindi document repositories are not amenable to indexing and retrieval. In this paper, we propose a scheme for retrieving relevant Hindi documents in response to a query word. This approach uses BLSTM neural networks. Designed to take contextual information …


Nearest Neighbor Based Collection Ocr, Pramod Sankar K., C. V. Jawahar, R. Manmatha Dec 2009

Nearest Neighbor Based Collection Ocr, Pramod Sankar K., C. V. Jawahar, R. Manmatha

R. Manmatha

Conventional optical character recognition (OCR) systems operate on individual characters and words, and do not normally exploit document or collection context. We describe a Collection OCR which takes advantage of the fact that multiple examples of the same word (often in the same font) may occur in a document or collection. The idea here is that an OCR or a reCAPTCHA like process generates a partial set of recognized words. In the second stage, a nearest neighbor algorithm compares the remaining word-images to those already recognized and propagates labels from the nearest neighbors. It is shown that by using an …


Finding Words In Alphabet Soup: Inference On Freeform Character Recognition For Historical Scripts, Nicholas R. Howe, Shaolei Feng, R. Manmatha Dec 2008

Finding Words In Alphabet Soup: Inference On Freeform Character Recognition For Historical Scripts, Nicholas R. Howe, Shaolei Feng, R. Manmatha

R. Manmatha

This paper develops word recognition methods for historical handwritten cursive and printed documents. It employs a powerful segmentation-free letter detection method based upon joint boosting with histogram-of-gradients features. Efficient inference on an ensemble of hidden Markov models can select the most probable sequence of candidate character detections to recognize complete words in ambiguous handwritten text, drawing on character n -gram and physical separation models. Experiments with two corpora of handwritten historic documents show that this approach recognizes known words more accurately than previous efforts, and can also recognize out-of-vocabulary words.


A Discrete Direct Retrieval Model For Image And Video Retrieval, Shaolei Feng, R. Manmatha Dec 2007

A Discrete Direct Retrieval Model For Image And Video Retrieval, Shaolei Feng, R. Manmatha

R. Manmatha

This paper proposes a formal framework for image and video retrieval using discrete Markov random fields(MRF). The training dataset consists of images with keywords (regions are not labeled). The model may be built using quantized region or point features generated from the training images. Unlike many previous techniques, our MRF based model doesn't require an explicit annotation step for retrieval. The model directly ranks all test images according to the posterior probability of an image given a query. Image and video retrieval experiments are performed on two standard datasets (one Corel datasets and a TRECVID3 dataset) which consist of 4,500 …


A Hierarchical, Hmmbased Automatic Evaluation Of Ocr Accuracy For A Digital Library Of Books, Shaolei Feng, R. Manmatha Dec 2005

A Hierarchical, Hmmbased Automatic Evaluation Of Ocr Accuracy For A Digital Library Of Books, Shaolei Feng, R. Manmatha

R. Manmatha

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a …


Combining Text And Audio-Visual Features In Video Indexing, Shih-Fu Chang, R. Manmatha, Tat-Seng Chua Dec 2004

Combining Text And Audio-Visual Features In Video Indexing, Shih-Fu Chang, R. Manmatha, Tat-Seng Chua

R. Manmatha

We discuss the opportunities, state of the art, and open research issues in using multi-modal features in video indexing. Specifically, we focus on how imperfect text data obtained by automatic speech recognition (ASR) may be used to help solve challenging problems, such as story segmentation, concept detection, retrieval, and topic clustering. We review the frameworks and machine learning techniques that are used to fuse the text features with audio-visual features. Case studies showing promising performance will be described, primarily in the broadcast news video domain.


Boosted Decision Trees For Word Recognition In Handwritten Document Retrieval, Nicholas R. Howe, Toni M. Rath, R. Manmatha Dec 2004

Boosted Decision Trees For Word Recognition In Handwritten Document Retrieval, Nicholas R. Howe, Toni M. Rath, R. Manmatha

R. Manmatha

Recognition and retrieval of historical handwritten material is an unsolved problem. We propose a novel approach to recognizing and retrieving handwritten manuscripts, based upon word image classification as a key step. Decision trees with normalized pixels as features form the basis of a highly accurate AdaBoost classifier, trained on a corpus of word images that have been resized and sampled at a pyramid of resolutions. To stem problems from the highly skewed distribution of class frequencies, word classes with very few training samples are augmented with stochastically altered versions of the originals. This increases recognition performance substantially. On a standard …


Joint Visualtext Modeling For Automatic Retrieval Of Multimedia Documents, G. Iyengar, P. Duygulu, S. Feng, P. Ircing, S. P. Khudanpur, D. Klakow, M. R. Krause, R. Manmatha, H. J. Nock, D. Petkova, B. Pytlik, P. Virga Dec 2004

Joint Visualtext Modeling For Automatic Retrieval Of Multimedia Documents, G. Iyengar, P. Duygulu, S. Feng, P. Ircing, S. P. Khudanpur, D. Klakow, M. R. Krause, R. Manmatha, H. J. Nock, D. Petkova, B. Pytlik, P. Virga

R. Manmatha

In this paper we describe our approach for jointly modeling the text part and the visual part of multimedia documents for the purpose of information retrieval(IR). In the prevalent state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in signi.- cant improvement …


Classification Models For Historical Manuscript Recognition, S. L. Feng, R. Manmatha Dec 2004

Classification Models For Historical Manuscript Recognition, S. L. Feng, R. Manmatha

R. Manmatha

This paper investigates different machine learning models to solve the historical handwritten manuscript recognition problem. In particular, we test and compare support vector machines, conditional maximum entropy models and Naive Bayes with kernel density estimates and explore their behaviors and properties when solving this problem. We focus on a whole word problem to avoid having to do character segmentation which is difficult with degraded handwritten documents. Our results on a publicly available standard dataset of 20 pages of George Washington's manuscripts show that Naive Bayes with Gaussian kernel density estimates significantly outperforms the other models and prior work using hidden …


Statistical Models For Automatic Video Annotation And Retrieval, V. Lavrenko, S. L. Feng, R. Manmatha Dec 2003

Statistical Models For Automatic Video Annotation And Retrieval, V. Lavrenko, S. L. Feng, R. Manmatha

R. Manmatha

We apply a continuous relevance model (CRM) to the problem of directly retrieving the visual content of videos using text queries. The model computes a joint probability model for image features and words using a training set of annotated images. The model may then be used to annotate unseen test images. The probabilistic annotations are used for retrieval using text queries. We also propose a modified model - the normalized CRM - which substantially improves performance on a subset of the TREC Video dataset.


A Scale Space Approach For Automatically Segmenting Words From Historical Handwritten Documents, R. Manmatha, Jamie L. Rothfeder Dec 2003

A Scale Space Approach For Automatically Segmenting Words From Historical Handwritten Documents, R. Manmatha, Jamie L. Rothfeder

R. Manmatha

Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/ retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, …


Holistic Word Recognition For Handwritten Historical Documents, Victor Lavrenko, Toni M. Rath, R. Manmatha Dec 2003

Holistic Word Recognition For Handwritten Historical Documents, Victor Lavrenko, Toni M. Rath, R. Manmatha

R. Manmatha

Most offline handwriting recognition approaches proceed by segmenting words into smaller pieces (usually characters) which are recognized separately. The recognition result of a word is then the composition of the individually recognized parts. Inspired by results in cognitive psychology, researchers have begun to focus on holistic word recognition approaches. Here we present a holistic word recognition approach for single-author historical documents, which is motivated by the fact that for severely degraded documents a segmentation of words into characters will produce very poor results. The quality of the original documents does not allow us to recognize them with high accuracy - …


A Search Engine For Historical Manuscript Images, Toni M. Rath, R. Manmatha, Victor Lavrenko Dec 2003

A Search Engine For Historical Manuscript Images, Toni M. Rath, R. Manmatha, Victor Lavrenko

R. Manmatha

Many museum and library archives are digitizing their large collections of handwritten historical manuscripts to enable public access to them. These collections are only available in image formats and require expensive manual annotation work for access to them. Current handwriting recognizers have word error rates in excess of 50% and therefore cannot be used for such material. We describe two statistical models for retrieval in large collections of handwritten manuscripts given a text query. Both use a set of transcribed page images to learn a joint probability distribution between features computed from word images and their transcriptions. The models can …


An Inference Network Approach To Image Retrieval, Donald Metzler, R. Manmatha Dec 2003

An Inference Network Approach To Image Retrieval, Donald Metzler, R. Manmatha

R. Manmatha

Most image retrieval systems only allow a fragment of text or an example image as a query. Most users have more complex information needs that are not easily expressed in either of these forms. This paper proposes a model based on the Inference Network framework from information retrieval that employs a powerful query language that allows structured query operators, term weighting, and the combination of text and images within a query. The model uses non-parametric methods to estimate probabilities within the inference network. Image annotation and retrieval results are reported and compared against other published systems and illustrative structured and …


Using Maximum Entropy For Automatic Image Annotation, Jiwoon Jeon, R. Manmatha Dec 2003

Using Maximum Entropy For Automatic Image Annotation, Jiwoon Jeon, R. Manmatha

R. Manmatha

In this paper, we propose the use of the Maximum Entropy approach for the task of automatic image annotation. Given labeled training data, Maximum Entropy is a statistical technique which allows one to predict the probability of a label given test data. The techniques allow for relationships between features to be effectively captured and has been successfully applied to a number of language tasks including machine translation. In our case, we view the image annotation task as one where a training data set of images labeled with keywords is provided and we need to automatically label the test images with …


Server Selection Techniques For Distribution Information Retrieval, Yoshiya Kinuta, Brian Neil Levine, R. Manmatha Dec 2002

Server Selection Techniques For Distribution Information Retrieval, Yoshiya Kinuta, Brian Neil Levine, R. Manmatha

R. Manmatha

Server selection is typically defined as maximizing network performance under the assumption that each server holds an exact replica of all data. We propose and evaluate methods of server selection when servers are not exact replicas such that we maximize both network performance and information retrieval (IR) precision (i.e., the relevance of retrieved data). We show that naive composition of previously proposed techniques from networking and IR perform poorly. We propose improving the performance of current IR selection techniques by using language model/based selection to construct local replicas of databases that network selection predicts are likely to be poor network …


A Statistical Approach To Retrieving Historical Manuscript Images Without Recognition, Toni M. Rath, Victor Lavrenko, R. Manmatha Dec 2002

A Statistical Approach To Retrieving Historical Manuscript Images Without Recognition, Toni M. Rath, Victor Lavrenko, R. Manmatha

R. Manmatha

Handwritten historical document collections in libraries and other areas are often of interest to researchers, students or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as ASCII text). Several solutions are possible: manual annotation (very expensive), handwriting recognition (poor results) and word spotting - an image matching approach (computationally expensive).

In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described …


Retrieving Historical Manuscripts Using Shape, Toni M. Rath, Victor Lavrenko, R. Manmatha Dec 2002

Retrieving Historical Manuscripts Using Shape, Toni M. Rath, Victor Lavrenko, R. Manmatha

R. Manmatha

Convenient access to handwritten historical document collections in libraries generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Currently, extensive manual labor is used to annotate and organize such collections, because handwriting recognition approaches provide only poor results on old documents.

In this work, we present a novel retrieval approach for historical document collections, which does not require recognition. We assume that word images can be described using a vocabulary of discretized word features. From a training set of labeled word images, we …


Indexing Of Handwritten Historical Documents - Recent Progress, R. Manmatha, Toni M. Rath Dec 2002

Indexing Of Handwritten Historical Documents - Recent Progress, R. Manmatha, Toni M. Rath

R. Manmatha

Indexing and searching collections of handwritten archival documents and manuscripts has always been a challenge because handwriting recognizers do not perform well on such noisy documents. Given a collection of documents written by a single author (or a few authors), one can apply a technique called word spotting. The approach is to cluster word images based on their visual appearance, after segmenting them from the documents. Annotation can then be performed for clusters rather than documents.

Given segmented pages, matching handwritten word images in historical documents is a great challenge due to the variations in handwriting and the noise in …


Text Alignment With Handwritten Documents, E. Micah Kornfield, R. Manmatha, James Allan Dec 2002

Text Alignment With Handwritten Documents, E. Micah Kornfield, R. Manmatha, James Allan

R. Manmatha

Todays digital libraries increasingly include not only printed text but also scanned handwritten pages and other multimedia material. There are, however, few tools available for manipulating handwritten pages. Here, we propose an algorithm based on dynamic time warping (DTW) for a word by word alignment ofhandwritten documents with their (ASCII) transcripts. We see at least three uses for such alignment al gorithms. First, alignment algorithms allow us to produce displays (for example on the web) which allow a person to easily find their place in the manuscript when reading a transcript. Second, such alignment algorithms will allow us to produce …


Challenges In Information Retrieval And Language Modeling, Jay Aslam, Nichols Belkin, Chris Buckley, Jamie Callan, Sue Dumais, Norbert Fuhr, Donna Harman, David J. Harper, Djoerd Hiemstra, Thomas Hofmann, Eduard Hovy, Wessel Kraaij, John Lafferty, Victor Lavrenko, David Lewis, Liz Liddy, R. Manmatha, Andrew Mccallum, Jay Ponte, John Prager, Dragomir Radev, Philip Resnik, Stephen Robertson, Roni Rosenfeld, Salim Roukos, Mark Sanderson, Rich Schwartz, Amit Singhal, Alan Smeaton, Howard Turtle, Ellen Voorhees, Ralph Weischedel, Jinxi Xu, Chengxiang Zhai Dec 2002

Challenges In Information Retrieval And Language Modeling, Jay Aslam, Nichols Belkin, Chris Buckley, Jamie Callan, Sue Dumais, Norbert Fuhr, Donna Harman, David J. Harper, Djoerd Hiemstra, Thomas Hofmann, Eduard Hovy, Wessel Kraaij, John Lafferty, Victor Lavrenko, David Lewis, Liz Liddy, R. Manmatha, Andrew Mccallum, Jay Ponte, John Prager, Dragomir Radev, Philip Resnik, Stephen Robertson, Roni Rosenfeld, Salim Roukos, Mark Sanderson, Rich Schwartz, Amit Singhal, Alan Smeaton, Howard Turtle, Ellen Voorhees, Ralph Weischedel, Jinxi Xu, Chengxiang Zhai

R. Manmatha

Information retrieval (IR) research has reached a point where it is appropriate to assess progress and to define a research agenda for the next five to ten years. This report summarizes a discussion of IR research challenges that took place at a recent workshop.

The attendees of the workshop considered information retrieval research in a range of areas chosen to give broad coverage of topic areas that engage information retrieval researchers. Those areas are retrieval models, cross-lingual retrieval, Web search, user modeling, filtering, topic detection and tracking, classification, summarization, question answering, metasearch, distributed retrieval, multimedia retrieval, information extraction, as well …