Collection Analysis · Archives Unleashed Toolkit

Extract All URLs

How do I get a list of all URLs in the collection?

Scala RDD

import io.archivesunleashed._

RecordLoader.loadArchives("/path/to/warcs", sc).keepValidPages()
  .map(r => r.getUrl)
  .take(10)

What do I do with the results? See this guide!

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc).webpages()
  .select($"url")
  .show(20, false)

What do I do with the results? See this guide!

Python DF

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs") \
  .webpages() \
  .select("url") \
  .show(20, False)

What do I do with the results? See this guide!

Extract Top-Level Domains

How do I extract a list of the top-level domains (and count how many pages belong in each top-level domain)?

Scala RDD

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/warcs", sc).keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

What do I do with the results? See this guide!

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc).webpages()
  .select(extractDomain($"url").as("domain"))
  .groupBy("domain").count().orderBy(desc("count"))
  .show(20, false)

What do I do with the results? See this guide!

Python DF

from aut import *
from pyspark.sql.functions import desc

WebArchive(sc, sqlContext, "/path/to/warcs") \
  .webpages() \
  .select(extract_domain("url").alias("domain")) \
  .groupBy("domain") \
  .count() \
  .sort(desc("count")) \
  .show(10, False)

What do I do with the results? See this guide!

Extract Different Subdomains

How do I use regular expressions to extract for fine-grained URL information? For example, supposed I wanted to extract the first-level directories?

Scala RDD

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/warcs", sc).keepValidPages()
  .flatMap(r => """http://[^/]+/[^/]+/""".r.findAllIn(r.getUrl).toList)
  .take(10)

In the above example, """....""" declares that we are working with a regular expression, .r says turn it into a regular expression, .findAllIn says look for all matches in the URL. This will only return the first but that is generally good for our use cases. Finally, .toList turns it into a list so you can flatMap.

What do I do with the results? See this guide!

Scala DF


import io.archivesunleashed._
import io.archivesunleashed.udfs._

val urlPattern = Array("""http://[^/]+/[^/]+/""")

RecordLoader.loadArchives("/path/to/warcs", sc)
  .webpages()
  .select($"url")
  .filter(hasUrlPatterns($"url", lit(urlPattern)))
  .show(10, false)

Python DF

from aut import *
from pyspark.sql.functions import col

url_pattern = "http://[^/]+/[^/]+/"

WebArchive(sc, sqlContext, "/path/to/warcs") \
  .webpages() \
  .select("url") \
  .filter(col("url").rlike(url_pattern)) \
  .show(10, False)

Extract HTTP Status Codes

How do I get the HTTP Status Code associated with each resource in the collection?

Scala RDD

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/warcs", sc)
  .map(r => (r.getUrl, r.getHttpStatus))
  .take(10)

What do I do with the results? See this guide!

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc)
  .all()
  .select($"url", $"http_status_code")
  .show(10, false)

Python DF

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs") \
  .all() \
  .select("url", "http_status_code") \
  .show(10, False)

Extract the Location of the Resource in ARCs and WARCs

How do I find out the WARC or ARC that each page is contained in?

Scala RDD

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/warcs", sc).keepValidPages()
  .map(r => (r.getUrl, r.getArchiveFilename))
  .take(10)

Or, if you just want to know the filename, without the full path and filename, the following script will do that.

import io.archivesunleashed._
import io.archivesunleashed.matchbox._
import org.apache.commons.io.FilenameUtils

RecordLoader.loadArchives("/path/to/warcs", sc).keepValidPages()
  .map(r => (r.getUrl, FilenameUtils.getName(r.getArchiveFilename)))
  .take(10)

What do I do with the results? See this guide!

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.udfs._

RecordLoader.loadArchives("/path/to/warcs", sc)
  .all()
  .select($"url", $"archive_filename")
  .show(10, false)

Python DF

from aut import *

WebArchive(sc, sqlContext, "/path/to/warcs") \
  .all() \
  .select("url", "archive_filename") \
  .show(10, False)