The Toolkit
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Apache Spark, which provides powerful tools for analytics and data processing.
This documentation is based on a cookbook approach, providing a series of "recipes" for addressing a number of common analytics tasks to provide inspiration for your own analysis. We generally provide examples for resilient distributed datasets (RDD) in Scala, and DataFrames in both Scala and Python. We leave it up to you to choose Scala or Python flavours of Spark.
If you want to learn more about Apache Spark, we highly recommend Spark: The Definitive Guide.
Table of Contents
Our documentation is divided into several main sections, which cover the Archives Unleashed Toolkit workflow from analyzing collections to understanding and working with the results.
Getting Started
- Dependencies
- Usage
- Using the Archives Unleashed Toolkit at Scale
- Toolkit Walkthrough
- DataFrame Schemas
Generating Results
- Collection Analysis: How do I...
- Text Analysis: How do I...
- Extract All Plain Text
- Extract Plain Text Without HTTP Headers
- Extract Plain Text By Domain
- Extract Plain Text by URL Pattern
- Extract Plain Text Minus Boilerplate
- Extract Plain Text Filtered by Date
- Extract Plain Text Filtered by Language
- Extract Plain Text Filtered by Keyword
- Extract Raw HTML
- Extract Named Entities
- Link Analysis: How do I...
- Binary Analysis: How do I...
- Text Files (html, text, css, js, json, xml) Analysis: How do I...
Filtering Results
Standard Derivatives
How do I...
- Use the Toolkit with spark-submit
- Create the Archives Research Compute Hub (ARCH) Derivatives
- Extract Binary Info
- Extract Binaries to Disk
What to do with Results
Citing Archives Unleashed
How to cite the Archives Unleashed Toolkit or Cloud in your research:
Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL '20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513.
Your citations help to further the recognition of using open-source tools for scientific inquiry, assists in growing the web archiving community, and acknowledges the efforts of contributors to this project.
Further Reading
The following two articles provide an overview of the project:
- Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.
- Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. Proceedings of the 2020 IEEE/ACM Joint Conference on Digital Libraries (JCDL 2020), Wuhan, China.
Acknowledgments
This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.