Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Old Dominion University

Computer Sciences

Computer Science Faculty Publications

Digital libraries

Articles 1 - 22 of 22

Full-Text Articles in Physical Sciences and Mathematics

D-Lib Magazine Pioneered Web-Based Scholarly Communication, Michael L. Nelson, Herbert Van De Sompel Jan 2022

D-Lib Magazine Pioneered Web-Based Scholarly Communication, Michael L. Nelson, Herbert Van De Sompel

Computer Science Faculty Publications

The web began with a vision of, as stated by Tim Berners-Lee in 1991, “that much academic information should be freely available to anyone”. For many years, the development of the web and the development of digital libraries and other scholarly communications infrastructure proceeded in tandem. A milestone occurred in July, 1995, when the first issue of D-Lib Magazine was published as an online, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as …


Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox Jan 2021

Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox

Computer Science Faculty Publications

Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a …


A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox Jan 2020

A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox

Computer Science Faculty Publications

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, …


Swimming In A Sea Of Javascript Or: How I Learned To Stop Worrying And Love High-Fidelity Replay, John A. Berlin, Michael L. Nelson, Michele C. Weigle Jan 2018

Swimming In A Sea Of Javascript Or: How I Learned To Stop Worrying And Love High-Fidelity Replay, John A. Berlin, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

[First paragraph] Preserving and replaying modern web pages in high-fidelity has become an increasingly difficult task due to the increased usage of JavaScript. Reliance on server-side rewriting alone results in live-leakage and or the inability to replay a page due to the preserved JavaScript performing an action not permissible from the archive. The current state-of-the-art high fidelity archival preservation and replay solutions rely on handcrafted client-side URL rewriting libraries specifically tailored for the archive, namely Webrecoder's and Pywb's wombat.js [12]. Web archives not utilizing client-side rewriting rely on server-side rewriting that misses URLs used in a manner not accounted for …


Client-Assisted Memento Aggregation Using The Prefer Header, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle Jan 2018

Client-Assisted Memento Aggregation Using The Prefer Header, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

[First paragraph] Preservation of the Web ensures that future generations have a picture of how the web was. Web archives like Internet Archive's Wayback Machine, WebCite, and archive.is allow individuals to submit URIs to be archived, but the captures they preserve then reside at the archives. Traversing these captures in time as preserved by multiple archive sources (using Memento [8]) provides a more comprehensive picture of the past Web than relying on a single archive. Some content on the Web, such as content behind authentication, may be unsuitable or inaccessible for preservation by these organizations. Furthermore, this content may be …


Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Jan 2017

Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

[First paragraph] A Composite Memento is an archived representation of a web page with all the page requisites such as images and stylesheets. All embedded resources have their own URIs, hence, they are archived independently. For a meaningful archival replay, it is important to load all the page requisites from the archive within the temporal neighborhood of the base HTML page. To achieve this goal, archival replay systems try to rewrite all the resource references to appropriate archived versions before serving HTML, CSS, or JS. However, an effective server-side URL rewriting is difficult when URLs are generated dynamically using JavaScript. …


Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson Jan 2014

Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson

Computer Science Faculty Publications

Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, …


Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson Jan 2008

Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson

Computer Science Faculty Publications

There are innumerable departmental, community, and personal web sites worthy of long-term preservation but proportionally fewer archivists available to properly prepare and process such sites. We propose a simple model for such everyday web sites which takes advantage of the web server itself to help prepare the site's resources for preservation. This is accomplished by having metadata utilities analyze the resource at the time of dissemination. The web server responds to the archiving repository crawler by sending both the resource and the just-in-time generated metadata as a straight-forward XML-formatted response. We call this complex object (resource + metadata) a CRATE. …


Synchronization And Multiple Group Server Support For Kepler, K. Maly, M. Zubair, H. Siripuram, S. Zunjarwad, Yannis Manolopoulos (Ed.), Joaquim Filipe (Ed.), Panos Constantopoulos (Ed.), José Cordeiro (Ed.) Jan 2006

Synchronization And Multiple Group Server Support For Kepler, K. Maly, M. Zubair, H. Siripuram, S. Zunjarwad, Yannis Manolopoulos (Ed.), Joaquim Filipe (Ed.), Panos Constantopoulos (Ed.), José Cordeiro (Ed.)

Computer Science Faculty Publications

In the last decade literally thousands of digital libraries have emerged but one of the biggest obstacles for dissemination of information to a user community is that many digital libraries use different, proprietary technologies that inhibit interoperability. Kepler framework addresses interoperability and gives publication control to individual publishers. In Kepler, OAI-PMH is used to support "personal data providers" or "archivelets".". In our vision, individual publishers can be integrated with an institutional repository like Dspace by means of a Kepler Group Digital Library (GDL). The GDL aggregates metadata and full text from archivelets and can act as an OAI-compliant data provider …


Fedcor: An Institutional Cordra Registry, Giridhar Manepalli, Henry Jerez, Michael L. Nelson Jan 2006

Fedcor: An Institutional Cordra Registry, Giridhar Manepalli, Henry Jerez, Michael L. Nelson

Computer Science Faculty Publications

FeDCOR (Federation of DSpace using CORDRA) is a registry-based federation system for DSpace instances. It is based on the CORDRA model. The first article in this issue of D-Lib Magazine describes the Advanced Distributed Learning-Registry (ADL-R) [1], which is the first operational CORDRA registry, and also includes an introduction to CORDRA. That introduction, or other prior knowledge of the CORDRA effort, is recommended for the best understanding of this article, which builds on that base to describe in detail the FeDCOR approach.


Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson Jan 2005

Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson

Computer Science Faculty Publications

The author performed a variety of research, development and consulting tasks for NASA Langley Research Center in the area of digital libraries (DLs) and supporting technologies, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). In particular, the development focused on the NASA Technical Report Server (NTRS) and its transition from a distributed searching model to one that uses the OAI-PMH. The Open Archives Initiative (OAI) is an international consortium focused on furthering the interoperability of DLs through the use of "metadata harvesting". The OAI-PMH version of NTRS went into public production on April 28, 2003. Since that …


Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson Jan 2005

Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson

Computer Science Faculty Publications

Web-based digital libraries have historically been built in isolation utilizing different technologies, protocols, and metadata. These differences hindered the development of digital library services that enable users to discover information from multiple libraries through a single unified interface. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a major, international effort to address technical interoperability among distributed repositories. Arc debuted in 2000 as the first end-user OAI-PMH service provider. Since that time, Arc has grown to include nearly 7,000,000 metadata records. Arc has been deployed in a number of environments and has served as the basis for many other …


Metadata And Buckets In The Smart Object, Dumb Archive (Soda) Model, Michael L. Nelson, Kurt Maly, Delwin R. Croom Jr., Steven W. Robbins Jan 2004

Metadata And Buckets In The Smart Object, Dumb Archive (Soda) Model, Michael L. Nelson, Kurt Maly, Delwin R. Croom Jr., Steven W. Robbins

Computer Science Faculty Publications

We present the Smart Object, Dumb Archive (SODA) model for digital libraries (DLs), and discuss the role of metadata in SODA. The premise of the SODA model is to "push down" many of the functionalities generally associated with archives into the data objects themselves. Thus the data objects become "smarter", and the archives "dumber". In the SODA model, archives become primarily set managers, and the objects themselves negotiate and handle presentation, enforce terms and conditions, and perform data content management. Buckets are our implementation of smart objects, and da is our reference implementation for dumb archives. We also present our …


Report On The Third Acm/Ieee Joint Conference On Digital Libraries (Jcdl), Michael L. Nelson Jan 2003

Report On The Third Acm/Ieee Joint Conference On Digital Libraries (Jcdl), Michael L. Nelson

Computer Science Faculty Publications

The Third ACM/IEEE Joint Conference on Digital Libraries (JCDL 2003) was held on the campus of Rice University in Houston, Texas, May 27 - 31. Regarding the merging of the ACM and IEEE conference series, in the JCDL 2002 conference report published last year in D-Lib Magazine Edie Rasmussen noted, "Perhaps by next year...no one will remember that it wasn't always so" [1]. Judging by the number of participants I met who did not know that the ACM and IEEE used to hold separate digital library conferences, Rasmussen's prediction has come to pass.


Object Persistence And Availability In Digital Libraries, Michael L. Nelson, B. Danette Allen Jan 2002

Object Persistence And Availability In Digital Libraries, Michael L. Nelson, B. Danette Allen

Computer Science Faculty Publications

We have studied object persistence and availability of 1,000 digital library (DL) objects. Twenty World Wide Web accessible DLs were chosen and from each DL, 50 objects were chosen at random. A script checked the availability of each object three times a week for just over 1 year for a total of 161 data samples. During this time span, we found 31 objects (3% of the total) that appear to no longer be available: 24 from PubMed Central, 5 from IDEAS, 1 from CogPrints, and 1 from ETD.


A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson Jan 2002

A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson

Computer Science Faculty Publications

This article discusses the requirements of current and emerging applications based on the Open Archives Initiative (OAI) and emphasizes the need for a common infrastructure to support them. Inspired by HTTP proxy, cache, gateway and web service concepts, a design for a scalable and reliable infrastructure that aims at satisfying these requirements is presented. Moreover, it is shown how various applications can exploit the services included in the proposed infrastructure. The article concludes by discussing the current status of several prototype implementations.


Arc - An Oai Service Provider For Digital Library Federation, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson Jan 2001

Arc - An Oai Service Provider For Digital Library Federation, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson

Computer Science Faculty Publications

The usefulness of the many on-line journals and scientific digital libraries that exist today is limited by the inability to federate these resources through a unified interface. The Open Archive Initiative (OAI) is one major effort to address technical interoperability among distributed archives. The objective of OAI is to develop a framework to facilitate the discovery of content in distributed archives. In this paper, we describe our experience and lessons learned in building Arc, the first federated searching service based on the OAI protocol. Arc harvests metadata from several OAI compliant archives, normalizes them, and stores them in a search …


Smart Objects And Open Archives, Michael L. Nelson, Kurt Maly Jan 2001

Smart Objects And Open Archives, Michael L. Nelson, Kurt Maly

Computer Science Faculty Publications

Within the context of digital libraries (DLs), we are making information objects "first-class citizens". We decouple information objects from the systems used for their storage and retrieval, allowing the technology for both DLs and information content to progress independently. We believe dismantling the stovepipe of "DL-archive-content" is the first step in building richer DL experiences for users and insuring the long-term survivability of digital information. To demonstrate this partitioning between DLs, archives and information content, we introduce "buckets": aggregative, intelligent, object-oriented constructs for publishing in digital libraries. Buckets exist within the "Smart Object, Dumb Archive" (SODA) DL model, which promotes …


Buckets: Smart Objects For Digital Libraries, Michael L. Nelson Jan 2001

Buckets: Smart Objects For Digital Libraries, Michael L. Nelson

Computer Science Faculty Publications

Current discussion of digital libraries (DLs) is often dominated by the merits of the respective storage, search and retrieval functionality of archives, repositories, search engines, search interfaces and database systems. While these technologies are necessary for information management, the information content is more important than the systems used for its storage and retrieval. Digital information should have the same long-term survivability prospects as traditional hardcopy information and should be protected to the extent possible from evolving search engine technologies and vendor vagaries in database management systems. Information content and information retrieval systems should progress on independent paths and make limited …


The Ups Prototype: An Experimental End-User Service Across E-Print Archives, Herbert Van De Sompel, Thomas Krichel, Michael L. Nelson, Patrick Hochstenbach, Victor Lyapunov, Kurt Maly, Mohammad Zubair, Mohamed Kholief, Xiaoming Liu, Heath O'Connell Jan 2000

The Ups Prototype: An Experimental End-User Service Across E-Print Archives, Herbert Van De Sompel, Thomas Krichel, Michael L. Nelson, Patrick Hochstenbach, Victor Lyapunov, Kurt Maly, Mohammad Zubair, Mohamed Kholief, Xiaoming Liu, Heath O'Connell

Computer Science Faculty Publications

A meeting was held in Santa Fe, New Mexico, October 21-22, 1999, to generate discussion and consensus about interoperability of publicly available scholarly information archives. The invitees represented several well known e-print and report archive initiatives, as well as organizations with interests in digital libraries and the transformation of scholarly communication. The central goal of the meeting was to agree on recommendations that would make the creation of end-user services -- such as scientific search engines and linking systems -- for data originating from distributed and dissimilar archives easier. The Universal Preprint Service (UPS) Prototype was developed in preparation for …


A Digital Library For The National Advisory Committee For Aeronautics, Michael L. Nelson Jan 1999

A Digital Library For The National Advisory Committee For Aeronautics, Michael L. Nelson

Computer Science Faculty Publications

We describe the digital library (DL) for the National Advisory Committee for Aeronautics (NACA), the NACA Technical Report Server (NACATRS). The predecessor organization for the National Aeronautics and Space Administration (NASA), NACA existed from 1915 until 1958. The primary manifestation of NACA's research was the NACA report series. We describe the process of converting this collection of reports to digital format and making it available on the World Wide Web (WWW) and is a node in the NASA Technical Report Server (NTRS). We describe the current state of the project, the resulting DL technology developed from the project, and the …


Buckets: Aggregative, Intelligent Agents For Publishing, Michael L. Nelson, Kurt Maly, Stewart N. T. Shen, Mohammad Zubair Jan 1998

Buckets: Aggregative, Intelligent Agents For Publishing, Michael L. Nelson, Kurt Maly, Stewart N. T. Shen, Mohammad Zubair

Computer Science Faculty Publications

Buckets are an aggregative, intelligent construct for publishing in digital libraries. The goal of research projects is to produce information. This information is often instantiated in several forms, differentiated by semantic types (report, software, video, datasets, etc.). A given semantic type can be further differentiated by syntactic representations as well (PostScript version, PDF version, Word version, etc.). Although the information was created together and subtle relationships can exist between them, different semantic instantiations are generally segregated along currently obsolete media boundaries. Reports are placed in report archives, software might go into a software archive, but most of the data and …