Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Selected Works

R. Manmatha

Partial duplicate detection

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Partial Duplicate Detection For Large Book Collections, Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha Dec 2010

Partial Duplicate Detection For Large Book Collections, Ismet Zeki Yalniz, Ethem F. Can, R. Manmatha

R. Manmatha

A framework is presented for discovering partial duplicates in large collections of scanned books with optical character recognition (OCR) errors. Each book in the collection is represented by the sequence of words (in the order they appear in the text) which appear only once in the book. These words are referred to as ``unique words'' and they constitute a small percentage of all the words in a typical book. Along with the order information the set of unique words provides a compact representation which is highly descriptive of the content and the flow of ideas in the book. By aligning …


Mining Relational Structure From Millions Of Books, David A. Smith, R. Manmatha, James Allan Dec 2010

Mining Relational Structure From Millions Of Books, David A. Smith, R. Manmatha, James Allan

R. Manmatha

Existing large-scale scanned book collections have many short- comings for data-driven research, from OCR of variable quality to the lack of accurate descriptive and structural meta-data. We argue that complementary research in inferring relational metadata is important in its own right to support use of these collections and that it can help to mitigate other problems with scanned book collections.