Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 13 of 13

Full-Text Articles in Physical Sciences and Mathematics

Hashes Are Not Suitable To Verify Fixity Of The Public Archived Web, Mohamed Aturban, Martin Klein, Herbert Van De Sompel, Sawood Alam, Michael L. Nelson, Michele C. Weigle Jan 2023

Hashes Are Not Suitable To Verify Fixity Of The Public Archived Web, Mohamed Aturban, Martin Klein, Herbert Van De Sompel, Sawood Alam, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the …


Smartcitecon: Implicit Citation Context Extraction From Academic Literature Using Unsupervised Learning, Chenrui Gao, Haoran Cui, Li Zhang, Jiamin Wang, Wei Lu, Jian Wu Jan 2020

Smartcitecon: Implicit Citation Context Extraction From Academic Literature Using Unsupervised Learning, Chenrui Gao, Haoran Cui, Li Zhang, Jiamin Wang, Wei Lu, Jian Wu

Computer Science Faculty Publications

We introduce SmartCiteCon (SCC), a Java API for extracting both explicit and implicit citation context from academic literature in English. The tool is built on a Support Vector Machine (SVM) model trained on a set of 7,058 manually annotated citation context sentences, curated from 34,000 papers in the ACL Anthology. The model with 19 features achieves F1=85.6%. SCC supports PDF, XML, and JSON files out-of-box, provided that they are conformed to certain schemas. The API supports single document processing and batch processing in parallel. It takes about 12–45 seconds on average depending on the format to process a …


A Transformative Concept: From Data Being Passive Objects To Data Being Active Subjects, Hans-Peter Plag, Shelley-Ann Jules-Plag Dec 2019

A Transformative Concept: From Data Being Passive Objects To Data Being Active Subjects, Hans-Peter Plag, Shelley-Ann Jules-Plag

OES Faculty Publications

The exploitation of potential societal benefits of Earth observations is hampered by users having to engage in often tedious processes to discover data and extract information and knowledge. A concept is introduced for a transition from the current perception of data as passive objects (DPO) to a new perception of data as active subjects (DAS). This transition would greatly increase data usage and exploitation, and support the extraction of knowledge from data products. Enabling the data subjects to actively reach out to potential users would revolutionize data dissemination and sharing and facilitate collaboration in user communities. The three core elements …


Document Classification In Support Of Automated Metadata Extraction Form Heterogeneous Collections, Paul K. Flynn Apr 2014

Document Classification In Support Of Automated Metadata Extraction Form Heterogeneous Collections, Paul K. Flynn

Computer Science Theses & Dissertations

A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying metadata fields by inspecting the document is easy for a human. The visual cues in the formatting of the document along with accumulated knowledge and intelligence make it easy for a human to identify various metadata fields. Even with the best possible automated procedures, numerous sources of error exist, including …


A Method For Identifying Personalized Representations In Web Archives, Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson Jan 2013

A Method For Identifying Personalized Representations In Web Archives, Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

Web resources are becoming increasingly personalized — two different users clicking on the same link at the same time can see content customized for each individual user. These changes result in multiple representations of a resource that cannot be canonicalized in Web archives. We identify characteristics of this problem by presenting a potential solution to generalize personalized representations in archives. We also present our proof-of-concept prototype that analyzes WARC (Web ARChive) format files, inserts metadata establishing relationships, and provides archive users the ability to navigate on the additional dimension of environment variables in a modified Wayback Machine.


Visualizing Digital Collections At Archive-It, Michele C. Weigle, Michael L. Nelson Dec 2012

Visualizing Digital Collections At Archive-It, Michele C. Weigle, Michael L. Nelson

Computer Science Presentations

PDF of a powerpoint presentation from a Archive-It Partners Meeting in Annapolis, Maryland, December 3, 2012. Also available on Slideshare.


Tools For A Preservation-Ready Web, Joan A. Smith, Michael L. Nelson Jul 2008

Tools For A Preservation-Ready Web, Joan A. Smith, Michael L. Nelson

Computer Science Presentations

PDF of a powerpoint presentation from the National Digital Information Infrastructure and Preservation Program (NDIIPP) Partners Meeting, Washington D.C., July 9, 2008. Also available on Slideshare.


Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson Jan 2008

Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson

Computer Science Faculty Publications

There are innumerable departmental, community, and personal web sites worthy of long-term preservation but proportionally fewer archivists available to properly prepare and process such sites. We propose a simple model for such everyday web sites which takes advantage of the web server itself to help prepare the site's resources for preservation. This is accomplished by having metadata utilities analyze the resource at the time of dissemination. The web server responds to the archiving repository crawler by sending both the resource and the just-in-time generated metadata as a straight-forward XML-formatted response. We call this complex object (resource + metadata) a CRATE. …


Lightweight Federation Of Non-Cooperating Digital Libraries, Rong Shi Apr 2005

Lightweight Federation Of Non-Cooperating Digital Libraries, Rong Shi

Computer Science Theses & Dissertations

This dissertation studies the challenges and issues faced in federating heterogeneous digital libraries (DLs). The objective of this research is to demonstrate the feasibility of interoperability among non-cooperating DLs by presenting a lightweight, data driven approach, or Data Centered Interoperability (DCI). We build a Lightweight Federated Digital Library (LFDL) system to provide federated search service for existing digital libraries with no prior coordination.

We describe the motivation, architecture, design and implementation of the LFDL. We develop, deploy, and evaluate key services of the federation. The major difference to existing DL interoperability approaches is one where we do not insist on …


Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson Jan 2005

Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson

Computer Science Faculty Publications

The author performed a variety of research, development and consulting tasks for NASA Langley Research Center in the area of digital libraries (DLs) and supporting technologies, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). In particular, the development focused on the NASA Technical Report Server (NTRS) and its transition from a distributed searching model to one that uses the OAI-PMH. The Open Archives Initiative (OAI) is an international consortium focused on furthering the interoperability of DLs through the use of "metadata harvesting". The OAI-PMH version of NTRS went into public production on April 28, 2003. Since that …


Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson Jan 2005

Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson

Computer Science Faculty Publications

Web-based digital libraries have historically been built in isolation utilizing different technologies, protocols, and metadata. These differences hindered the development of digital library services that enable users to discover information from multiple libraries through a single unified interface. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a major, international effort to address technical interoperability among distributed repositories. Arc debuted in 2000 as the first end-user OAI-PMH service provider. Since that time, Arc has grown to include nearly 7,000,000 metadata records. Arc has been deployed in a number of environments and has served as the basis for many other …


Federated Searching Interface Techniques For Heterogeneous Oai Repositories, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Qiaoling Hong, Michael L. Nelson, Frances Knudson, Irma Holtkamp Jan 2002

Federated Searching Interface Techniques For Heterogeneous Oai Repositories, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Qiaoling Hong, Michael L. Nelson, Frances Knudson, Irma Holtkamp

Computer Science Faculty Publications

Federating repositories by harvesting heterogeneous collections with varying degrees of metadata richness poses a number of challenging issues: (1) how to address the lack of uniform control for various metadata fields in terms of building a rich unified search interface, and (2) how easily new collections and freshly harvested data in existing repositories can be incorporated into the federation supporting a unified interface? This paper focuses on the approaches taken to address these issues in Arc, an Open Archives Initiative compliant federated digital library. At present Arc contains over 1M metadata records from 75 data providers from various subject domains. …


A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson Jan 2002

A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson

Computer Science Faculty Publications

This article discusses the requirements of current and emerging applications based on the Open Archives Initiative (OAI) and emphasizes the need for a common infrastructure to support them. Inspired by HTTP proxy, cache, gateway and web service concepts, a design for a scalable and reliable infrastructure that aims at satisfying these requirements is presented. Moreover, it is shown how various applications can exploit the services included in the proposed infrastructure. The article concludes by discussing the current status of several prototype implementations.