Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 9 of 9

Full-Text Articles in Computer Sciences

Assessing The Prevalence And Archival Rate Of Uris To Git Hosting Platforms In Scholarly Publications, Emily Escamilla Aug 2023

Assessing The Prevalence And Archival Rate Of Uris To Git Hosting Platforms In Scholarly Publications, Emily Escamilla

Computer Science Theses & Dissertations

The definition of scholarly content has expanded to include the data and source code that contribute to a publication. While major archiving efforts to preserve conventional scholarly content, typically in PDFs (e.g., LOCKSS, CLOCKSS, Portico), are underway, no analogous effort has yet emerged to preserve the data and code referenced in those PDFs, particularly the scholarly code hosted online on Git Hosting Platforms (GHPs). Similarly, Software Heritage is working to archive public source code, but there is value in archiving the surrounding ephemera that provide important context to the code while maintaining their original URIs. In current implementations, source code …


Hashes Are Not Suitable To Verify Fixity Of The Public Archived Web, Mohamed Aturban, Martin Klein, Herbert Van De Sompel, Sawood Alam, Michael L. Nelson, Michele C. Weigle Jan 2023

Hashes Are Not Suitable To Verify Fixity Of The Public Archived Web, Mohamed Aturban, Martin Klein, Herbert Van De Sompel, Sawood Alam, Michael L. Nelson, Michele C. Weigle

Computer Science Faculty Publications

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the …


Legal And Technical Issues For Text And Data Mining In Greece, Maria Kanellopoulou - Botti, Marinos Papadopoulos, Christos Zampakolas, Paraskevi Ganatsiou May 2019

Legal And Technical Issues For Text And Data Mining In Greece, Maria Kanellopoulou - Botti, Marinos Papadopoulos, Christos Zampakolas, Paraskevi Ganatsiou

Computer Ethics - Philosophical Enquiry (CEPE) Proceedings

Web harvesting and archiving pertains to the processes of collecting from the web and archiving of works that reside on the Web. Web harvesting and archiving is one of the most attractive applications for libraries which plan ahead for their future operation. When works retrieved from the Web are turned into archived and documented material to be found in a library, the amount of works that can be found in said library can be far greater than the number of works harvested from the Web. The proposed participation in the 2019 CEPE Conference aims at presenting certain issues related to …


To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages, John Andrew Berlin Apr 2018

To Relive The Web: A Framework For The Transformation And Archival Replay Of Web Pages, John Andrew Berlin

Computer Science Theses & Dissertations

When replaying an archived web page (known as a memento), the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives to modify the page and its embedded resources, so that they no longer reference (link to) the original server(s) they were archived from but back to the archive. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. Unfortunately, because the replay of mementos and the modifications made …


Leveraging Heritrix And The Wayback Machine On A Corporate Intranet: A Case Study On Improving Corporate Archives, Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, Michael L. Nelson Jan 2016

Leveraging Heritrix And The Wayback Machine On A Corporate Intranet: A Case Study On Improving Corporate Archives, Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, Michael L. Nelson

Computer Science Faculty Publications

In this work, we present a case study in which we investigate using open-source, web-scale web archiving tools (i.e., Heritrix and the Wayback Machine installed on the MITRE Intranet) to automatically archive a corporate Intranet. We use this case study to outline the challenges of Intranet web archiving, identify situations in which the open source tools are not well suited for the needs of the corporate archivists, and make recommendations for future corporate archivists wishing to use such tools. We performed a crawl of 143,268 URIs (125 GB and 25 hours) to demonstrate that the crawlers are easy to set …


Web Archive Services Framework For Tighter Integration Between The Past And Present Web, Ahmed Alsum Apr 2014

Web Archive Services Framework For Tighter Integration Between The Past And Present Web, Ahmed Alsum

Computer Science Theses & Dissertations

Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users' needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build …


An Extensible Framework For Creating Personal Archives Of Web Resources Requiring Authentication, Matthew Ryan Kelly Jul 2012

An Extensible Framework For Creating Personal Archives Of Web Resources Requiring Authentication, Matthew Ryan Kelly

Computer Science Theses & Dissertations

The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. In recent years, many advances have been made in preserving web content but much of this content (namely, social media content) was not archived, or still to this day is not being archived,for various reasons. Tools built to accomplish this frequently break because of the dynamic structure of social media websites. Because many social media websites exhibit a commonality in hierarchy of the content, it would be worthwhile to setup a means to reference this …


Using The Web Infrastructure For Real Time Recovery Of Missing Web Pages, Martin Klein Jul 2011

Using The Web Infrastructure For Real Time Recovery Of Missing Web Pages, Martin Klein

Computer Science Theses & Dissertations

Given the dynamic nature of the World Wide Web, missing web pages, or "404 Page not Found" responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time" approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time …


Lazy Preservation: Reconstructing Websites From The Web Infrastructure, Frank Mccown Oct 2007

Lazy Preservation: Reconstructing Websites From The Web Infrastructure, Frank Mccown

Computer Science Theses & Dissertations

Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, webmasters or concerned third parties have attempted to recover some of their websites from the Internet Archive. Still others have sought to retrieve missing resources from the caches of commercial search engines. Inspired by these post hoc reconstruction attempts, this dissertation introduces the concept of lazy preservation{ digital preservation performed as a result of the normal operations of the Web Infrastructure (web archives, search engines and caches). First, the Web Infrastructure (WI) is characterized by its preservation …