Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 34

Full-Text Articles in Physical Sciences and Mathematics

Supporting Account-Based Queries For Archived Instagram Posts, Himarsha R. Jayanetti May 2023

Supporting Account-Based Queries For Archived Instagram Posts, Himarsha R. Jayanetti

Computer Science Theses & Dissertations

Social media has become one of the primary modes of communication in recent times, with popular platforms such as Facebook, Twitter, and Instagram leading the way. Despite its popularity, Instagram has not received as much attention in academic research compared to Facebook and Twitter, and its significant role in contemporary society is often overlooked. Web archives are making efforts to preserve social media content despite the challenges posed by the dynamic nature of these sites. The goal of our research is to facilitate the easy discovery of archived copies, or mementos, of all posts belonging to a specific Instagram account …


Metaenhance: Metadata Quality Improvement For Electronic Theses And Dissertations, Muntabir H. Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian Wu Jan 2023

Metaenhance: Metadata Quality Improvement For Electronic Theses And Dissertations, Muntabir H. Choudhury, Lamia Salsabil, Himarsha R. Jayanetti, Jian Wu

College of Sciences Posters

Metadata quality is crucial for digital objects to be discovered through digital library interfaces. Although DL systems have adopted Dublin Core to standardize metadata formats (e.g., ETD-MS v1.11), the metadata of digital objects may contain incomplete, inconsistent, and incorrect values [1]. Most existing frameworks to improve metadata quality rely on crowdsourced correction approaches, e.g., [2]. Such methods are usually slow and biased toward documents that are more discoverable by users. Artificial intelligence (AI) based methods can be adopted to overcome this limit by automatically detecting, correcting, and canonicalizing the metadata, featuring quick and unbiased responses to document metadata. …


Machine Learning In Requirements Elicitation: A Literature Review, Cheligeer Cheligeer, Jingwei Huang, Guosong Wu, Nadia Bhuiyan, Yuan Xu, Yong Zeng Jan 2022

Machine Learning In Requirements Elicitation: A Literature Review, Cheligeer Cheligeer, Jingwei Huang, Guosong Wu, Nadia Bhuiyan, Yuan Xu, Yong Zeng

Engineering Management & Systems Engineering Faculty Publications

A growing trend in requirements elicitation is the use of machine learning (ML) techniques to automate the cumbersome requirement handling process. This literature review summarizes and analyzes studies that incorporate ML and natural language processing (NLP) into demand elicitation. We answer the following research questions: (1) What requirement elicitation activities are supported by ML? (2) What data sources are used to build ML-based requirement solutions? (3) What technologies, algorithms, and tools are used to build ML-based requirement elicitation? (4) How to construct an ML-based requirements elicitation method? (5) What are the available tools to support ML-based requirements elicitation methodology? Keywords …


D-Lib Magazine Pioneered Web-Based Scholarly Communication, Michael L. Nelson, Herbert Van De Sompel Jan 2022

D-Lib Magazine Pioneered Web-Based Scholarly Communication, Michael L. Nelson, Herbert Van De Sompel

Computer Science Faculty Publications

The web began with a vision of, as stated by Tim Berners-Lee in 1991, “that much academic information should be freely available to anyone”. For many years, the development of the web and the development of digital libraries and other scholarly communications infrastructure proceeded in tandem. A milestone occurred in July, 1995, when the first issue of D-Lib Magazine was published as an online, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as …


Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox Jan 2021

Automatic Metadata Extraction Incorporating Visual Features From Scanned Electronic Theses And Dissertations, Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A. Ingram, Edward A. Fox

Computer Science Faculty Publications

Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a …


A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox Jan 2020

A Heuristic Baseline Method For Metadata Extraction From Scanned Electronic Theses And Dissertations, Muntabir H. Choudhury, Jian Wu, William A. Ingam, Edward A. Fox

Computer Science Faculty Publications

Extracting metadata from scholarly papers is an important text mining problem. Widely used open-source tools such as GROBID are designed for born-digital scholarly papers but often fail for scanned documents, such as Electronic Theses and Dissertations (ETDs). Here we present a preliminary baseline work with a heuristic model to extract metadata from the cover pages of scanned ETDs. The process started with converting scanned pages into images and then text files by applying OCR tools. Then a series of carefully designed regular expressions for each field is applied, capturing patterns for seven metadata fields: titles, authors, years, degrees, academic programs, …


Client-Assisted Memento Aggregation Using The Prefer Header, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle Jan 2018

Client-Assisted Memento Aggregation Using The Prefer Header, Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

[First paragraph] Preservation of the Web ensures that future generations have a picture of how the web was. Web archives like Internet Archive's Wayback Machine, WebCite, and archive.is allow individuals to submit URIs to be archived, but the captures they preserve then reside at the archives. Traversing these captures in time as preserved by multiple archive sources (using Memento [8]) provides a more comprehensive picture of the past Web than relying on a single archive. Some content on the Web, such as content behind authentication, may be unsuitable or inaccessible for preservation by these organizations. Furthermore, this content may be …


Swimming In A Sea Of Javascript Or: How I Learned To Stop Worrying And Love High-Fidelity Replay, John A. Berlin, Michael L. Nelson, Michele C. Weigle Jan 2018

Swimming In A Sea Of Javascript Or: How I Learned To Stop Worrying And Love High-Fidelity Replay, John A. Berlin, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

[First paragraph] Preserving and replaying modern web pages in high-fidelity has become an increasingly difficult task due to the increased usage of JavaScript. Reliance on server-side rewriting alone results in live-leakage and or the inability to replay a page due to the preserved JavaScript performing an action not permissible from the archive. The current state-of-the-art high fidelity archival preservation and replay solutions rely on handcrafted client-side URL rewriting libraries specifically tailored for the archive, namely Webrecoder's and Pywb's wombat.js [12]. Web archives not utilizing client-side rewriting rely on server-side rewriting that misses URLs used in a manner not accounted for …


Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Jan 2017

Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

[First paragraph] A Composite Memento is an archived representation of a web page with all the page requisites such as images and stylesheets. All embedded resources have their own URIs, hence, they are archived independently. For a meaningful archival replay, it is important to load all the page requisites from the archive within the temporal neighborhood of the base HTML page. To achieve this goal, archival replay systems try to rewrite all the resource references to appropriate archived versions before serving HTML, CSS, or JS. However, an effective server-side URL rewriting is difficult when URLs are generated dynamically using JavaScript. …


Scripts In A Frame: A Framework For Archiving Deferred Representations, Justin F. Brunelle Apr 2016

Scripts In A Frame: A Framework For Archiving Deferred Representations, Justin F. Brunelle

Computer Science Theses & Dissertations

Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival …


Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson Jan 2014

Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson

Computer Science Faculty Publications

Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, …


Integrating Preservation Functions Into The Web Server, Joan A. Smith Jul 2008

Integrating Preservation Functions Into The Web Server, Joan A. Smith

Computer Science Theses & Dissertations

Digital preservation of theWorldWideWeb poses unique challenges, different fromthe preservation issues facing professional Digital Libraries. The complete list of a website’s resources cannot be cited with confidence, and the descriptive metadata available for the resources is so minimal that it is sometimes insufficient for a browser to recognize. In short, the Web suffers from a counting problem and a representation problem. Refreshing the bits, migrating from an obsolete file format to a newer format, and other classic digital preservation problems also affect the Web. As digital collections devise solutions to these problems, the Web will also benefit. But the core …


Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson Jan 2008

Creating Preservation-Ready Web Resources, Joan A. Smith, Michael L. Nelson

Computer Science Faculty Publications

There are innumerable departmental, community, and personal web sites worthy of long-term preservation but proportionally fewer archivists available to properly prepare and process such sites. We propose a simple model for such everyday web sites which takes advantage of the web server itself to help prepare the site's resources for preservation. This is accomplished by having metadata utilities analyze the resource at the time of dissemination. The web server responds to the archiving repository crawler by sending both the resource and the just-in-time generated metadata as a straight-forward XML-formatted response. We call this complex object (resource + metadata) a CRATE. …


Fedcor: An Institutional Cordra Registry, Giridhar Manepalli, Henry Jerez, Michael L. Nelson Jan 2006

Fedcor: An Institutional Cordra Registry, Giridhar Manepalli, Henry Jerez, Michael L. Nelson

Computer Science Faculty Publications

FeDCOR (Federation of DSpace using CORDRA) is a registry-based federation system for DSpace instances. It is based on the CORDRA model. The first article in this issue of D-Lib Magazine describes the Advanced Distributed Learning-Registry (ADL-R) [1], which is the first operational CORDRA registry, and also includes an introduction to CORDRA. That introduction, or other prior knowledge of the CORDRA effort, is recommended for the best understanding of this article, which builds on that base to describe in detail the FeDCOR approach.


Synchronization And Multiple Group Server Support For Kepler, K. Maly, M. Zubair, H. Siripuram, S. Zunjarwad, Yannis Manolopoulos (Ed.), Joaquim Filipe (Ed.), Panos Constantopoulos (Ed.), José Cordeiro (Ed.) Jan 2006

Synchronization And Multiple Group Server Support For Kepler, K. Maly, M. Zubair, H. Siripuram, S. Zunjarwad, Yannis Manolopoulos (Ed.), Joaquim Filipe (Ed.), Panos Constantopoulos (Ed.), José Cordeiro (Ed.)

Computer Science Faculty Publications

In the last decade literally thousands of digital libraries have emerged but one of the biggest obstacles for dissemination of information to a user community is that many digital libraries use different, proprietary technologies that inhibit interoperability. Kepler framework addresses interoperability and gives publication control to individual publishers. In Kepler, OAI-PMH is used to support "personal data providers" or "archivelets".". In our vision, individual publishers can be integrated with an institutional repository like Dspace by means of a Kepler Group Digital Library (GDL). The GDL aggregates metadata and full text from archivelets and can act as an OAI-compliant data provider …


Lightweight Federation Of Non-Cooperating Digital Libraries, Rong Shi Apr 2005

Lightweight Federation Of Non-Cooperating Digital Libraries, Rong Shi

Computer Science Theses & Dissertations

This dissertation studies the challenges and issues faced in federating heterogeneous digital libraries (DLs). The objective of this research is to demonstrate the feasibility of interoperability among non-cooperating DLs by presenting a lightweight, data driven approach, or Data Centered Interoperability (DCI). We build a Lightweight Federated Digital Library (LFDL) system to provide federated search service for existing digital libraries with no prior coordination.

We describe the motivation, architecture, design and implementation of the LFDL. We develop, deploy, and evaluate key services of the federation. The major difference to existing DL interoperability approaches is one where we do not insist on …


Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson Jan 2005

Final Report For The Development Of The Nasa Technical Report Server (Ntrs), Michael L. Nelson

Computer Science Faculty Publications

The author performed a variety of research, development and consulting tasks for NASA Langley Research Center in the area of digital libraries (DLs) and supporting technologies, such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). In particular, the development focused on the NASA Technical Report Server (NTRS) and its transition from a distributed searching model to one that uses the OAI-PMH. The Open Archives Initiative (OAI) is an international consortium focused on furthering the interoperability of DLs through the use of "metadata harvesting". The OAI-PMH version of NTRS went into public production on April 28, 2003. Since that …


Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson Jan 2005

Lessons Learned With Arc, An Oai-Pmh Service Provider, Xiaoming Liu, Kurt Maly, Michael L. Nelson

Computer Science Faculty Publications

Web-based digital libraries have historically been built in isolation utilizing different technologies, protocols, and metadata. These differences hindered the development of digital library services that enable users to discover information from multiple libraries through a single unified interface. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a major, international effort to address technical interoperability among distributed repositories. Arc debuted in 2000 as the first end-user OAI-PMH service provider. Since that time, Arc has grown to include nearly 7,000,000 metadata records. Arc has been deployed in a number of environments and has served as the basis for many other …


Recommender Systems For Multimedia Libraries: An Evaluation Of Different Models For Datamining Usage Data, Raquel Oliveira Araujo Dec 2004

Recommender Systems For Multimedia Libraries: An Evaluation Of Different Models For Datamining Usage Data, Raquel Oliveira Araujo

Computer Science Theses & Dissertations

Many recommender systems exist today to help users deal with the large growth in the amount of information available in the Internet. Most of these recommender systems use collaborative filtering or content-based techniques to present new material that would be of interest to a user. While these methods have proven to be effective, they have not been designed specifically for multimedia collections. In this study we present a new method to find recommendations that is not dependent on traditional Information Retrieval (IR) methods and compare it to algorithms that do rely on traditional IR methods. We evaluated these algorithms using …


Metadata And Buckets In The Smart Object, Dumb Archive (Soda) Model, Michael L. Nelson, Kurt Maly, Delwin R. Croom Jr., Steven W. Robbins Jan 2004

Metadata And Buckets In The Smart Object, Dumb Archive (Soda) Model, Michael L. Nelson, Kurt Maly, Delwin R. Croom Jr., Steven W. Robbins

Computer Science Faculty Publications

We present the Smart Object, Dumb Archive (SODA) model for digital libraries (DLs), and discuss the role of metadata in SODA. The premise of the SODA model is to "push down" many of the functionalities generally associated with archives into the data objects themselves. Thus the data objects become "smarter", and the archives "dumber". In the SODA model, archives become primarily set managers, and the objects themselves negotiate and handle presentation, enforce terms and conditions, and perform data content management. Buckets are our implementation of smart objects, and da is our reference implementation for dumb archives. We also present our …


Event Based Retrieval From Digital Libraries Containing Data Streams, Mohamed Hamed Kholief Jul 2003

Event Based Retrieval From Digital Libraries Containing Data Streams, Mohamed Hamed Kholief

Computer Science Theses & Dissertations

The objective of this research is to study the issues involved in building a digital library that contains data streams and allows event-based retrieval. “Digital Libraries are storehouses of information available through the Internet that provide ways to collect, store, and organize data and make it accessible for search, retrieval, and processing” [29]. Data streams are sources of information for applications such as news-on-demand, weather services, and scientific research, to name a few. A data stream is a sequence of data units produced over a period of time. Examples of data streams are video streams, audio stream, and sensor readings. …


Report On The Third Acm/Ieee Joint Conference On Digital Libraries (Jcdl), Michael L. Nelson Jan 2003

Report On The Third Acm/Ieee Joint Conference On Digital Libraries (Jcdl), Michael L. Nelson

Computer Science Faculty Publications

The Third ACM/IEEE Joint Conference on Digital Libraries (JCDL 2003) was held on the campus of Rice University in Houston, Texas, May 27 - 31. Regarding the merging of the ACM and IEEE conference series, in the JCDL 2002 conference report published last year in D-Lib Magazine Edie Rasmussen noted, "Perhaps by next year...no one will remember that it wasn't always so" [1]. Judging by the number of participants I met who did not know that the ACM and IEEE used to hold separate digital library conferences, Rasmussen's prediction has come to pass.


Federating Heterogeneous Digital Libraries By Metadata Harvesting, Xiaoming Liu Jan 2002

Federating Heterogeneous Digital Libraries By Metadata Harvesting, Xiaoming Liu

Computer Science Theses & Dissertations

This dissertation studies the challenges and issues faced in federating heterogeneous digital libraries (DLs) by metadata harvesting. The objective of federation is to provide high-level services (e.g. transparent search across all DLs) on the collective metadata from different digital libraries. There are two main approaches to federate DLs: distributed searching approach and harvesting approach. As the distributed searching approach replies on executing queries to digital libraries in real time, it has problems with scalability. The difficulty of creating a distributed searching service for a large federation is the motivation behind Open Archives Initiatives Protocols for Metadata Harvesting (OAI-PMH). OAI-PMH supports …


A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson Jan 2002

A Scalable Architecture For Harvest-Based Digital Libraries, Xiaoming Liu, Tim Brody, Stevan Harnard, Les Carr, Kurt Maly, Mohammad Zubair, Michael L. Nelson

Computer Science Faculty Publications

This article discusses the requirements of current and emerging applications based on the Open Archives Initiative (OAI) and emphasizes the need for a common infrastructure to support them. Inspired by HTTP proxy, cache, gateway and web service concepts, a design for a scalable and reliable infrastructure that aims at satisfying these requirements is presented. Moreover, it is shown how various applications can exploit the services included in the proposed infrastructure. The article concludes by discussing the current status of several prototype implementations.


Object Persistence And Availability In Digital Libraries, Michael L. Nelson, B. Danette Allen Jan 2002

Object Persistence And Availability In Digital Libraries, Michael L. Nelson, B. Danette Allen

Computer Science Faculty Publications

We have studied object persistence and availability of 1,000 digital library (DL) objects. Twenty World Wide Web accessible DLs were chosen and from each DL, 50 objects were chosen at random. A script checked the availability of each object three times a week for just over 1 year for a total of 161 data samples. During this time span, we found 31 objects (3% of the total) that appear to no longer be available: 24 from PubMed Central, 5 from IDEAS, 1 from CogPrints, and 1 from ETD.


Arc - An Oai Service Provider For Digital Library Federation, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson Jan 2001

Arc - An Oai Service Provider For Digital Library Federation, Xiaoming Liu, Kurt Maly, Mohammad Zubair, Michael L. Nelson

Computer Science Faculty Publications

The usefulness of the many on-line journals and scientific digital libraries that exist today is limited by the inability to federate these resources through a unified interface. The Open Archive Initiative (OAI) is one major effort to address technical interoperability among distributed archives. The objective of OAI is to develop a framework to facilitate the discovery of content in distributed archives. In this paper, we describe our experience and lessons learned in building Arc, the first federated searching service based on the OAI protocol. Arc harvests metadata from several OAI compliant archives, normalizes them, and stores them in a search …


Smart Objects And Open Archives, Michael L. Nelson, Kurt Maly Jan 2001

Smart Objects And Open Archives, Michael L. Nelson, Kurt Maly

Computer Science Faculty Publications

Within the context of digital libraries (DLs), we are making information objects "first-class citizens". We decouple information objects from the systems used for their storage and retrieval, allowing the technology for both DLs and information content to progress independently. We believe dismantling the stovepipe of "DL-archive-content" is the first step in building richer DL experiences for users and insuring the long-term survivability of digital information. To demonstrate this partitioning between DLs, archives and information content, we introduce "buckets": aggregative, intelligent, object-oriented constructs for publishing in digital libraries. Buckets exist within the "Smart Object, Dumb Archive" (SODA) DL model, which promotes …


Buckets: Smart Objects For Digital Libraries, Michael L. Nelson Jan 2001

Buckets: Smart Objects For Digital Libraries, Michael L. Nelson

Computer Science Faculty Publications

Current discussion of digital libraries (DLs) is often dominated by the merits of the respective storage, search and retrieval functionality of archives, repositories, search engines, search interfaces and database systems. While these technologies are necessary for information management, the information content is more important than the systems used for its storage and retrieval. Digital information should have the same long-term survivability prospects as traditional hardcopy information and should be protected to the extent possible from evolving search engine technologies and vendor vagaries in database management systems. Information content and information retrieval systems should progress on independent paths and make limited …


Buckets: Smart Objects For Digital Libraries, Michael L. Nelson Jul 2000

Buckets: Smart Objects For Digital Libraries, Michael L. Nelson

Computer Science Theses & Dissertations

Discussion of digital libraries (DLs) is often dominated by the merits of various archives, repositories, search engines, search interfaces and database systems. While these technologies are necessary for information management, information content and information retrieval systems should progress on independent paths and each should make limited assumptions about the status or capabilities of the other. Information content is more important than the systems used for its storage and retrieval. Digital information should have the same long-term survivability prospects as traditional hardcopy information and should not be impacted by evolving search engine technologies or vendor vagaries in database management systems.

Digital …


The Ups Prototype: An Experimental End-User Service Across E-Print Archives, Herbert Van De Sompel, Thomas Krichel, Michael L. Nelson, Patrick Hochstenbach, Victor Lyapunov, Kurt Maly, Mohammad Zubair, Mohamed Kholief, Xiaoming Liu, Heath O'Connell Jan 2000

The Ups Prototype: An Experimental End-User Service Across E-Print Archives, Herbert Van De Sompel, Thomas Krichel, Michael L. Nelson, Patrick Hochstenbach, Victor Lyapunov, Kurt Maly, Mohammad Zubair, Mohamed Kholief, Xiaoming Liu, Heath O'Connell

Computer Science Faculty Publications

A meeting was held in Santa Fe, New Mexico, October 21-22, 1999, to generate discussion and consensus about interoperability of publicly available scholarly information archives. The invitees represented several well known e-print and report archive initiatives, as well as organizations with interests in digital libraries and the transformation of scholarly communication. The central goal of the meeting was to agree on recommendations that would make the creation of end-user services -- such as scientific search engines and linking systems -- for data originating from distributed and dissimilar archives easier. The Universal Preprint Service (UPS) Prototype was developed in preparation for …