Physical Sciences and Mathematics | Open Access Articles

Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Jan 2017

Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

[First paragraph] A Composite Memento is an archived representation of a web page with all the page requisites such as images and stylesheets. All embedded resources have their own URIs, hence, they are archived independently. For a meaningful archival replay, it is important to load all the page requisites from the archive within the temporal neighborhood of the base HTML page. To achieve this goal, archival replay systems try to rewrite all the resource references to appropriate archived versions before serving HTML, CSS, or JS. However, an effective server-side URL rewriting is difficult when URLs are generated dynamically using JavaScript. …

Go to article

Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson Jan 2014

Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson

Computer Science Faculty Publications

Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, …

Go to article

Factors Affecting Website Reconstruction From The Web Infrastructure, Frank Mccown, Norou Diawara, Michael L. Nelson Jun 2007

Factors Affecting Website Reconstruction From The Web Infrastructure, Frank Mccown, Norou Diawara, Michael L. Nelson

Computer Science Faculty Publications

When a website is suddenly lost without a backup, it may be reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared …

Go to article

Brass: A Queueing Manager For Warrick, Frank Mccown, Amine Benjelloun, Michael L. Nelson Jan 2007

Brass: A Queueing Manager For Warrick, Frank Mccown, Amine Benjelloun, Michael L. Nelson

Computer Science Faculty Publications

When an individual loses their website and a backup can-not be found, they can download and run Warrick, a web-repository crawler which will recover their lost website by crawling the holdings of the Internet Archive and several search engine caches. Running Warrick locally requires some technical know-how, so we have created an on-line queueing system called Brass which simplifies the task of recovering lost websites. We discuss the technical aspects of recon-structing websites and the implementation of Brass. Our newly developed system allows anyone to recover a lost web-site with a few mouse clicks and allows us to track which …

Go to article

Crate: A Simple Model For Self-Describing Web Resources, Joan A. Smith, Michael L. Nelson Jan 2007

Crate: A Simple Model For Self-Describing Web Resources, Joan A. Smith, Michael L. Nelson

Computer Science Faculty Publications

If not for the Internet Archive’s eﬀorts to store periodic snapshots of the web, many sites would not have any preservation prospects at all. The barrier to entry is too high for everyday web sites, which may have skilled webmasters managing them, but which lack skilled archivists to preserve them. Digital preservation is not easy. One problem is the complexity of preservation models, which have speciﬁc meta-data and structural requirements. Another problem is the time and eﬀort it takes to properly prepare digital resources for preservation in the chosen model. In this paper, we propose a simple preservation model called …

Go to article

Observed Web Robot Behavior On Decaying Web Subsites, Joan A. Smith, Frank Mccown, Michael L. Nelson Jan 2006

Observed Web Robot Behavior On Decaying Web Subsites, Joan A. Smith, Frank Mccown, Michael L. Nelson

Computer Science Faculty Publications

We describe the observed crawling patterns of various search engines (including Google, Yahoo and MSN) as they traverse a series of web subsites whose contents decay at predetermined rates. We plot the progress of the crawlers through the subsites, and their behaviors regarding the various file types included in the web subsites. We chose decaying subsites because we were originally interested in tracking the implication of using search engine caches for digital preservation. However, some of the crawling behaviors themselves proved to be interesting and have implications on using a search engine as an interface to a digital library.

Go to article

Physical Sciences and Mathematics Commons^™

Full-Text Articles in Physical Sciences and Mathematics

Avoiding Zombies In Archival Replay Using Serviceworker, Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

Moved But Not Gone: An Evaluation Of Real-Time Methods For Discovering Replacement Web Pages, Martin Klein, Michael L. Nelson

Computer Science Faculty Publications

Factors Affecting Website Reconstruction From The Web Infrastructure, Frank Mccown, Norou Diawara, Michael L. Nelson

Computer Science Faculty Publications

Brass: A Queueing Manager For Warrick, Frank Mccown, Amine Benjelloun, Michael L. Nelson

Computer Science Faculty Publications

Crate: A Simple Model For Self-Describing Web Resources, Joan A. Smith, Michael L. Nelson

Computer Science Faculty Publications

Observed Web Robot Behavior On Decaying Web Subsites, Joan A. Smith, Frank Mccown, Michael L. Nelson

Computer Science Faculty Publications