Archives Unleashed Toolkit · An open-source platform for analyzing web archives with Apache Spark

Archives Unleashed ToolkitAn open-source platform for analyzing web archives with Apache Spark

Gain Insights on Your Web Archives

Are you a researcher or web archivist looking to better understand your web archive collections? No matter the size -- gigabytes, terabytes, or petabytes -- the Archives Unleashed Toolkit can help!

Our documentation, based on a cookbook approach, provides a series of "recipes" for addressing a number of common analysis tasks to provide inspiration for your own analysis. We provide examples in Scala and Python, and we leave it up to you to choose Spark or PySpark!

Extract Text From Your Web Archives

Do you have WARCs or ARCs and want just the text? With the Toolkit, you can extract all the text from a web archive. Combine that with a variety of filters, like filtering by date, language, keyword, domain, or URL pattern, and soon you'll be mining text to your heart's content.

Explore Hyperlink Networks in a Web Archive

Hyperlinking practice can tell us a lot about web archives: where did people link to for their information? How did these links change over time? Which websites, based on their hyperlinks, were the most influential? The Toolkit allows you to extract web graphs, and organize them by URL pattern or crawl date. We also support GraphML and GEXF, for use with Gephi.

Learn About the Content in Your Collections

That's not all! You can use the Toolkit for collection analysis to understand top level domain, domain, and subdomain frequency, or understand the distribution of binary content like audio, images, videos, and documents. You can even extract all those PowerPoint presentations, spreadsheets, and PDFs in your web archive collections!