Open Access. Powered by Scholars. Published by Universities.®

Series

Computer Sciences

OCR

Publication Year

Articles 1 - 2 of 2

Full-Text Articles in Cataloging and Metadata

Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox Jan 2021

Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox

Computer Science Faculty Publications

Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a …


A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox Jan 2020

A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox

Computer Science Faculty Publications

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, …