Binary Analysis · Archives Unleashed Toolkit

Extract Audio Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).audio();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from audio files in a web collection:

audio url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://geocities....|    capasoligero.mp3|      mp3|          audio/mpeg|    audio/mpeg|fffd1aa802392be0f...|88e254b4cab7848a9...|//MozAAAAAAAAAAAA...|
|http://www.geocit...|        colorwnd.mid|      mid|          audio/midi|    audio/midi|fff3f4e8a473f7c9a...|aea92a6f32dd1a1f4...|TVRoZAAAAAYAAQAGA...|
|http://geocities....|santana_rob_thoma...|      mid|          audio/midi|    audio/midi|ffd4a24d4e4722d94...|28576c271898a1de5...|TVRoZAAAAAYAAQASA...|
|http://geocities....|           music.mid|      mid|          audio/midi|    audio/midi|ffcbe35e28b553481...|cf1ebdbe1a070d4f6...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        evrythng.mid|      mid|          audio/midi|    audio/midi|ff751c86728ff09b5...|d22fc0911d3ceb17a...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        evrythn2.mid|      mid|          audio/midi|    audio/midi|ff751c86728ff09b5...|d22fc0911d3ceb17a...|TVRoZAAAAAYAAAABA...|
|http://geocities....|          picket.mid|      mid|          audio/midi|    audio/midi|ff4d225a602630584...|ecef0a851cc028853...|TVRoZAAAAAYAAQAHA...|
|http://geocities....|        simpsons.mid|      mid|          audio/midi|    audio/midi|ff3bc375860979f2f...|9c1204dad686ddeea...|TVRoZAAAAAYAAQAPA...|
|http://www.geocit...|        simpsons.mid|      mid|          audio/midi|    audio/midi|ff3bc375860979f2f...|9c1204dad686ddeea...|TVRoZAAAAAYAAQAPA...|
|http://geocities....|        mypretty.wav|      wav|         audio/x-wav|audio/vnd.wave|ff1a5015d3a380955...|113de5c1bb2f7ddb4...|UklGRvz8AABXQVZFZ...|
|http://geocities....|          song37.mid|      mid|          audio/midi|    audio/midi|fee0a67ff7c71e35c...|ccd4fdfa0483d1058...|TVRoZAAAAAYAAAABA...|
|http://geocities....|    holdyourhand.mid|      mid|          audio/midi|    audio/midi|fed14ecd7099e3fb9...|24fe5c097db5d506a...|TVRoZAAAAAYAAQANA...|
|http://geocities....|    es_tu_sangre.mid|      mid|          audio/midi|    audio/midi|fec196e8086d868f2...|eccb1551d1e7b236e...|TVRoZAAAAAYAAQASA...|
|http://www.geocit...|          virgin.mid|      mid|          audio/midi|    audio/midi|fec0ce795723b1287...|cc651312b1d57fe64...|TVRoZAAAAAYAAQAMA...|
|http://www.geocit...|tonibraxtonunbrea...|      wav|         audio/x-wav|audio/vnd.wave|feb7e31a8edb0a484...|9420bdeece0f23b78...|UklGRtQoCgBXQVZFZ...|
|http://geocities....|      comeandsee.mid|      mid|          audio/midi|    audio/midi|feb513cd7b6fab9cc...|51b4c2bb113cb43aa...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        song186t.mid|      mid|          audio/midi|    audio/midi|fead61a5a439675a3...|c652eda8a4ec5d197...|TVRoZAAAAAYAAAABA...|
|http://geocities....|    be_magnified.mid|      mid|          audio/midi|    audio/midi|feac0e996e1555d84...|f51ec1e62a166fa82...|TVRoZAAAAAYAAQAPA...|
|http://geocities....|        EVERYBOD.MID|      mid|          audio/midi|    audio/midi|fea911b19f0cf709d...|58bcd1b3c0288cbe0...|TVRoZAAAAAYAAQAUA...|
|http://www.geocit...|        ff9waltz.mid|      mid|          audio/midi|    audio/midi|fe9eb1ea6d4b53a9f...|72e2467bfea6240b8...|TVRoZAAAAAYAAQAKA...|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+

If you wanted to work with all the audio files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).audio();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.audio()
df.show()

Will extract all following information from audio files in a web collection:

audio url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://www.geocit...|        hc-tibet.wav|      wav|         audio/x-wav|audio/vnd.wave|416ad26133f63dc3e...|dfb764d759187d102...|UklGRg6eAABXQVZFZ...|
|http://geocities....|bookmarkthissite.wav|      wav|         audio/x-wav|audio/vnd.wave|7897ff71780a903ca...|cfb942aeb3bc881cd...|UklGRppkAABXQVZFZ...|
|http://geocities....|   NeilYoung-Hey.mp3|      mp3|          audio/mpeg|    audio/mpeg|40869eb3181e6035b...|19fa693521cd8125c...|//uQRAAAAcAAsNUEA...|
|http://geocities....|          misty1.mp3|      mp3|          audio/mpeg|    audio/mpeg|d8cb3ce54072a7d4b...|43b92e16932c13a43...|//uQBAAAAsJl22mBE...|
|http://geocities....|            sale.mid|      mid|          audio/midi|    audio/midi|5dfc0c3dd884e50c7...|071840b4822ae5e80...|TVRoZAAAAAYAAQALA...|
|http://geocities....|        swaplink.mid|      mid|          audio/midi|    audio/midi|f32117ce2bffa9902...|0346223861c87acc1...|TVRoZAAAAAYAAQALA...|
|http://geocities....|              m5.mid|      mid|          audio/midi|    audio/midi|7e5eedebafecd26c4...|393dfbc00c49fcdc9...|TVRoZAAAAAYAAQAJA...|
|http://geocities....|          morder.mid|      mid|          audio/midi|    audio/midi|6cec0785377f5bbaf...|a94f0a75c0c3b3cf5...|TVRoZAAAAAYAAQAMA...|
|http://geocities....|              m2.mid|      mid|          audio/midi|    audio/midi|58b0102f997e689a2...|51ad469ebc931e160...|TVRoZAAAAAYAAQALA...|
|http://geocities....|           music.mid|      mid|          audio/midi|    audio/midi|7917a5a9d6ddfb8dd...|009db9df73cdf5247...|TVRoZAAAAAYAAQALA...|
|http://www.geocit...|        hcpopeye.wav|      wav|         audio/x-wav|audio/vnd.wave|04d7b45c70e0a496e...|9db0e61c16554af88...|UklGRrbAAABXQVZFZ...|
|http://geocities....|              m7.mid|      mid|          audio/midi|    audio/midi|3906ecaba32ba15a8...|e0d6e9f1c86b6204e...|TVRoZAAAAAYAAQAHA...|
|http://geocities....|           words.mid|      mid|          audio/midi|    audio/midi|30da01a4ed42ae469...|160b2e5aaa9b95641...|TVRoZAAAAAYAAQAIA...|
|http://geocities....|          brock5.mp3|      mp3|          audio/mpeg|    audio/mpeg|17f4e1c7a007983a5...|3bbdb27fafa4e8b12...|//MozAANkCLE/gjGA...|
|http://geocities....|          brock1.mp3|      mp3|          audio/mpeg|    audio/mpeg|67db65825afc326ed...|2ec4ac110cff19134...|//MozAAMyX7VmBjGl...|
|http://geocities....|       funkytown.wav|      wav|         audio/x-wav|audio/vnd.wave|6f841bcffe4bbb61d...|ab1fdb143d5752cf1...|UklGRlLOCQBXQVZFZ...|
|http://geocities....|  welcomemyworld.mid|      mid|          audio/midi|    audio/midi|c546eac675e2dd974...|cb4f1fa32aa1e3205...|TVRoZAAAAAYAAQAMA...|
|http://www.geocit...|        irisheye.mid|      mid|          audio/midi|    audio/midi|d906f32953742fdef...|f3ca7449483b0ea65...|TVRoZAAAAAYAAQAFA...|
|http://geocities....|       mission21.mid|      mid|          audio/midi|    audio/midi|c507304afe6cddba1...|72a74c1914044746f...|TVRoZAAAAAYAAQAVA...|
|http://geocities....|         tellit1.mid|      mid|          audio/midi|    audio/midi|a604ae85251d55504...|95096668900a76dc8...|TVRoZAAAAAYAAQAQA...|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Image Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).images();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"width", $"height", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from images in a web collection:

image url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
Width
Height
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|http://www.archiv...|mediatype_movies.gif|      gif|           image/gif|     image/gif|   21|    21|ff05f9b408519079c...|194800d702aab9b87...|R0lGODlhFQAVAKUpA...|
|http://www.archiv...|    LOCLogoSmall.jpg|      jpg|          image/jpeg|    image/jpeg|  275|   300|fbf1aec668101b960...|564c1a07152c12cea...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|   archive.small.jpg|      jpg|          image/jpeg|    image/jpeg|  300|   225|f611b554b9a44757d...|e9bf7ef0ae3fc50f5...|/9j/4RpBRXhpZgAAT...|
|http://tsunami.ar...|  tsunamiweb1_02.jpg|      jpg|          image/jpeg|    image/jpeg|  384|   229|f02005e29ffb485ca...|9eeb9c3c67d7efc51...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|alexa_websearch_l...|      gif|           image/gif|     image/gif|  301|    47|eecc909992272ce0d...|ea18e226f3cf40005...|R0lGODlhLQEvAPcAA...|
|http://www.archiv...|      lizardtech.gif|      gif|           image/gif|     image/gif|  140|    37|e7166743861126e51...|cf26e9ffc27be133f...|R0lGODlhjAAlANUwA...|
|http://www.archiv...|       half_star.png|      png|           image/png|     image/png|   14|    12|e1e101f116d9f8251...|736abd06a978e2fd2...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|         hewlett.jpg|      jpg|          image/jpeg|    image/jpeg|  300|   116|e1da27028b81db60e...|eb418c17901b1b313...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|prelinger-header-...|      jpg|          image/jpeg|    image/jpeg|   84|    72|d39cce8b2f3aaa783...|1c41a644123e8f861...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|           arrow.gif|      gif|           image/gif|     image/gif|   13|    11|c7ee6d7c17045495e...|7013764e619066e60...|R0lGODlhDQALALMAA...|
|http://www.archiv...|          folder.png|      png|           image/png|     image/png|   20|    15|c1905fb5f16232525...|ff7b8c60e8397cb5d...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|     wayback-wtc.gif|      gif|           image/gif|     image/gif|   35|    35|c15ec074d95fe7e1e...|f45425406600b136d...|R0lGODlhIwAjANUAA...|
|http://www.archiv...|     clicktoplay.png|      png|           image/png|     image/png|  320|   240|b148d9544a1a65ae4...|477105e3a93b60dd8...|iVBORw0KGgoAAAANS...|
|http://www.archiv...|    orange_arrow.gif|      gif|           image/gif|     image/gif|    8|    11|a820ac93e2a000c9d...|850b9daeef06bee6e...|R0lGODlhCAALAJECA...|
|http://www.archiv...|  arc-it-tagline.gif|      gif|           image/gif|     image/gif|  385|    30|9f70e6cc21ac55878...|4601e2f642d8e55ac...|R0lGODlhgQEeALMPA...|
|http://www.archiv...|          guitar.jpg|      jpg|          image/jpeg|    image/jpeg|  140|   171|9ed163df5065418db...|f6c9475009ae2416c...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|        blendbar.jpg|      jpg|          image/jpeg|    image/jpeg| 1800|    89|9e41e4d6bdd53cd9d...|dc780bf80720c87c9...|/9j/4AAQSkZJRgABA...|
|http://www.archiv...|alexalogo-archive...|      gif|           image/gif|     image/gif|  304|    36|9da73cf504be0eb70...|03e530ef04e4b68f7...|R0lGODlhMAEkAOYAA...|
|http://www.archiv...|             lma.jpg|      jpg|          image/jpeg|    image/jpeg|  215|    71|97ebd3441323f9b5d...|ff9485b26300721b2...|/9j/4AAQSkZJRgABA...|
|http://i.creative...|           88x31.png|      png|           image/png|     image/png|   88|    31|9772d34b683f8af83...|689bef4ffb8918612...|iVBORw0KGgoAAAANS...|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+

only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 7 more fields]

If you wanted to work with all the images in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).images();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.images()
df.show()

Will extract all following information from images in a web collection:

image url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
Width
Height
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|http://farm3.stat...|4047878934_ef12ba...|      jpg|          image/jpeg|    image/jpeg|  100|    75|e1a376f170b815f49...|2165fd2908950e9f6...|/9j/4AAQSkZJRgABA...|
|http://farm3.stat...|4047881126_fc6777...|      jpg|          image/jpeg|    image/jpeg|   75|   100|371a2a5142c611405...|933f937c949826696...|/9j/4AAQSkZJRgABA...|
|http://farm3.stat...|4047879492_a72dd8...|      jpg|          image/jpeg|    image/jpeg|  100|    75|8877679361cde970d...|31dbaaed2f7194c95...|/9j/4AAQSkZJRgABA...|
|http://farm3.stat...|4047877728_c6c118...|      jpg|          image/jpeg|    image/jpeg|   75|   100|8f009a568d47e1888...|7b83e7d6c78ed65cf...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|96d9290d060547781...|2d3005bd6e09ca064...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|c69d65d4880445b31...|abe40cb96bfc79095...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|cb11c08d43e25ec3b...|2060857d6cf41b141...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|756b5a0a83a621eb7...|d4625efc80efb985e...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|0b60007c3e3d9d63f...|a154035590a01efb4...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|97fdea388e1245691...|e415a77a4369ecef8...|/9j/4AAQSkZJRgABA...|
|http://img.youtub...|               0.jpg|      jpg|          image/jpeg|    image/jpeg|  480|   360|05c2d43f687f40b60...|ed3f6ca2f3d7e9569...|/9j/4AAQSkZJRgABA...|
|http://www.canadi...|     WebResource.axd|      gif|           image/gif|     image/gif|    1|     1|325472601571f31e1...|2daeaa8b5f19f0bc2...|R0lGODlhAQABAIAAA...|
|http://www.davids...|footprint-carbon.jpg|      jpg|          image/jpeg|    image/jpeg|  200|   200|51f57de92e76f3edc...|c970137cd3bfdbbba...|/9j/4AAQSkZJRgABA...|
|http://www.gca.ca...|              15.jpg|      jpg|          image/jpeg|    image/jpeg|  300|   230|8b3c192b9a0cc82d6...|851377ed11c9cd153...|/9j/4AAQSkZJRgABA...|
|http://www.equalv...|loadingAnimation.gif|      gif|           image/gif|     image/gif|  208|    13|c33734a1bf58bec32...|2bb50e01775289c24...|R0lGODlh0AANAMQAA...|
|http://www.davids...|Keep-greening-gre...|      jpg|          image/jpeg|    image/jpeg|  166|   252|4763383a8be13c735...|a42b963e18dc1e7d4...|/9j/4AAQSkZJRgABA...|
|http://www.davids...|Keep-greening-don...|      jpg|          image/jpeg|    image/jpeg|  146|   252|515bd44bea759e169...|75abeb65cc4f54c7d...|/9j/4AAQSkZJRgABA...|
|http://www.davids...|Keep-greening-eca...|      jpg|          image/jpeg|    image/jpeg|  158|   252|345f71df9702e99a0...|b6637ac654f6e2073...|/9j/4AAQSkZJRgABA...|
|http://www.davids...|Keep-greening-tit...|      jpg|          image/jpeg|    image/jpeg|  470|    45|385522fde90ac7e96...|b42151cf8c3ce14e0...|/9j/4AAQSkZJRgABA...|
|http://www.davids...|    last_minute2.jpg|      jpg|          image/jpeg|    image/jpeg|  265|    33|3defee897d4c553fc...|37c790bbc23c369d8...|/9j/4AAQSkZJRgABA...|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Most Frequent Image URLs

Scala RDD

The following script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("/path/to/warcs", sc)
  .keepValidPages()
  .flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
  .countItems()
  .take(10)

Will extract the top ten URLs of images found within a collection, in an array like so:

links: Array[(String, Int)] = Array((http://www.archive.org/images/star.png,408), (http://www.archive.org/images/no_star.png,122), (http://www.archive.org/images/logo.jpg,118), (http://www.archive.org/images/main-header.jpg,84), (http://www.archive.org/images/rss.png,20), (http://www.archive.org/images/mail.gif,13), (http://www.archive.org/images/half_star.png,10), (http://www.archive.org/images/arrow.gif,7), (http://ia300142.us.archive.org/3/items/americana/am_libraries.gif?cnt=0,3), (http://ia310121.us.archive.org/2/items/GratefulDead/gratefuldead.gif?cnt=0,3), (http://www.archive.org/images/wayback.gif,2), (http://www.archive.org/images/wayback-election2000.gif,2), (http://www.archive.org/images/wayback-wt...

If you wanted to work with the images, you could download them from the Internet Archive.

Let's use the top-ranked example. This link, for example, will show you the temporal distribution of the image. For a snapshot from September 2007, this URL would work:

http://web.archive.org/web/20070913051458/http://www.archive.org/images/star.png

To do analysis on all images, you could thus prepend http://web.archive.org/web/20070913051458/ to each URL and wget them en masse.

For more information on wget, please consult this lesson available on the Programming Historian website.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).imagegraph();

df.groupBy($"image_url")
  .count()
  .orderBy($"count".desc)
  .show(10)

Will extract the top ten URLs of images found within a collection, in a DataFrame like so:

+--------------------+-----+
|           image_url|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   83|
|http://www.archiv...|   49|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://ia300142.u...|    3|
|http://ia310121.u...|    3|
+--------------------+-----+
only showing top 10 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [src: string, image_url: string]

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.imagegraph()

df.groupBy("image_url")
  .count()
  .orderBy("count", ascending=False)
  .show(10)

Will extract the top ten URLs of images found within a collection, in a DataFrame like so:

+--------------------+-----+
|           image_url|count|
+--------------------+-----+
|http://www.archiv...|  408|
|http://www.archiv...|  122|
|http://www.archiv...|   83|
|http://www.archiv...|   49|
|http://www.archiv...|   20|
|http://www.archiv...|   13|
|http://www.archiv...|   10|
|http://www.archiv...|    7|
|http://ia300142.u...|    3|
|http://ia310121.u...|    3|
+--------------------+-----+

Extract Most Frequent Images MD5 Hash

Some images may be the same, but have different URLs. This UDF finds the popular images by calculating the MD5 hash of each and presents the most frequent images based on that metric. This script:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/path/to/warcs",sc).persist()
ExtractPopularImages(r, 500, sc).saveAsTextFile("500-Popular-Images")

Will save the 500 most popular URLs to an output directory.

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.app._

val df = RecordLoader.loadArchives("/path/to/warcs",sc).images()

ExtractPopularImagesDF(df,10,30,30).show()

Python DF

from aut import *

images = WebArchive(sc, sqlContext, "/path/to/warcs").images()

popular_images = ExtractPopularImages(images, 20, 10, 10)

popular_images.show()

Find Images Shared Between Domains

How to find images shared between domains that appear more than once in more than one domain.

Scala DF

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val images = RecordLoader.loadArchives("/path/to/warcs", sc)
                        .images()
                        .select(removePrefixWWW(extractDomain($"url")).as("domain"), $"url", $"md5")

val links = images.groupBy("md5").count().where(countDistinct("domain")>=2)

val result = images.join(links, "md5")
                   .groupBy("domain", "md5")
                   .agg(first("url").as("image_url"))
                   .orderBy(asc("md5"))
                   .write.csv("/path/to/output")

PythonDF

from aut import *
from pyspark.sql.functions import asc, countDistinct, first

images = WebArchive(sc, sqlContext, "/path/to/warcs") \
  .images() \
  .select(remove_prefix_www(extract_domain("url")).alias("domain"), "url", "md5")

links = images.groupBy("md5") \
              .count() \
              .where(countDistinct("domain")>=2)

result = images.join(links, "md5") \
               .groupBy("domain", "md5") \
               .agg(first("url").alias("image_url")) \
               .orderBy(asc("md5")) \
               .write.csv("/path/to/output")

Extract PDF Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).pdfs();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from PDF files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|adicec_sopar_2009...|      pdf|application/octet...|application/pdf|ffc2ccc373b8ffd39...|3831b0f228af1701e...|JVBERi0xLjMNJeLjz...|
|http://www.geocit...|      IntSt_2301.pdf|      pdf|application/octet...|application/pdf|ffa638c418dac2e19...|84dbaccde1ace4b24...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|            lotg.pdf|      pdf|application/octet...|application/pdf|ff871ef64d3739b03...|95a777f0b4c7703c6...|JVBERi0xLjINJeLjz...|
|http://geocities....|            ebad.pdf|      pdf|application/octet...|application/pdf|fe8feece5d08dc2ce...|0c01cc31b40a286da...|JVBERi0xLjMNJeLjz...|
|http://geocities....|      regulament.pdf|      pdf|application/octet...|application/pdf|fe8018451633fd76c...|9c7cc720e29cad6e8...|JVBERi0xLjMKJcfsj...|
|http://geocities....|dmatias_letterfor...|      pdf|application/octet...|application/pdf|fe7dbc89e664ba790...|dbe965e7a288cce59...|JVBERi0xLjYNJeLjz...|
|http://geocities....|overcome_the_fear...|      pdf|application/octet...|application/pdf|fe3ec0805564cd3fc...|d0d30ba4f7f40434d...|JVBERi0xLjMKJcfsj...|
|http://geocities....|       CIM_marks.pdf|      pdf|application/octet...|application/pdf|fe1622ac08b47cf60...|b97b57b3c77887324...|JVBERi0xLjMKJcTl8...|
|http://geocities....|           board.PDF|      pdf|application/octet...|application/pdf|fd969b57508d3b135...|fc121c07fefbb722b...|JVBERi0xLjIgDQol4...|
|http://geocities....|          cowell.pdf|      pdf|application/octet...|application/pdf|fbacc01cbe01aa0b4...|f9e9eba1b281ad800...|JVBERi0xLjMKJeLjz...|
|http://geocities....|        gdbrasil.pdf|      pdf|application/octet...|application/pdf|fadc9b9b2408a1112...|247671acb971ddc21...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|         EBOrder.pdf|      pdf|application/octet...|application/pdf|fa4a83d96441324b3...|5f6870832d035a5a9...|JVBERi0xLjINJeLjz...|
|http://geocities....|        butlleta.pdf|      pdf|application/octet...|application/pdf|fa13dfbf62acb5083...|9a8ec0c0e8a190f46...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|ALABAMAUNDERWOODM...|      pdf|application/octet...|application/pdf|f9791c7df35d9092a...|3e4c0ca1031152d24...|JVBERi0xLjIgDQol4...|
|http://geocities....|         chimera.pdf|      pdf|application/octet...|application/pdf|f92a40f58cffcdc8e...|ba038d0146b0171f2...|JVBERi0xLjMKJcfsj...|
|http://geocities....|          icarus.pdf|      pdf|application/octet...|application/pdf|f8da963b714e684b3...|4444f5a12c9dbb1df...|JVBERi0xLjMKJcfsj...|
|http://geocities....|2008_ClubFinances...|      pdf|application/octet...|application/pdf|f878c0373edbc89f9...|700393c7b6aaf93df...|JVBERi0xLjQNJeLjz...|
|http://geocities....|  WILLOWSTScene5.pdf|      pdf|application/octet...|application/pdf|f84fc521602fdf163...|5f03b19201536cbc8...|JVBERi0xLjQKJcfsj...|
|http://geocities....|        isrherb2.pdf|      pdf|application/octet...|application/pdf|f83390642e9fe6313...|60befa2b5913bb19d...|JVBERi0xLjMNJeLjz...|
|http://geocities....|            joel.pdf|      pdf|application/octet...|application/pdf|f828e4b447c085fdd...|2e3308c1a52f2f75a...|JVBERi0xLjQKJcOkw...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the PDF files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).pdfs();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.pdfs()
df.show()

Will extract all following information from PDF files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|20080304ordinance...|      pdf|application/octet...|application/pdf|ebbf9bf99b363493b...|f0b9a6788cbc1f8ab...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008ADOP...|      pdf|application/octet...|application/pdf|4fe261c2210189a52...|a91180b9170ff757f...|JVBERi0xLjQNJeLjz...|
|http://geocities....|            Menu.pdf|      pdf|application/octet...|application/pdf|75e4d587589a1d85d...|d18724100d4616a45...|JVBERi0xLjMNJeLjz...|
|http://geocities....|DSTC2009ContestFl...|      pdf|application/octet...|application/pdf|c80f38f96480aab0c...|369c4415ed9c2476d...|JVBERi0xLjQNJeLjz...|
|http://geocities....|            ebad.pdf|      pdf|application/octet...|application/pdf|fe8feece5d08dc2ce...|0c01cc31b40a286da...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008APPR...|      pdf|application/octet...|application/pdf|8747971e78acb768b...|770f97a95c7e2ee16...|JVBERi0xLjQNJeLjz...|
|http://geocities....|FACTSHEET2008APPE...|      pdf|application/octet...|application/pdf|32f57bbe5b28f4ab1...|d4f63b8d29f4c5dc5...|JVBERi0xLjQNJeLjz...|
|http://geocities....|FACTSHEET2008ADOP...|      pdf|application/octet...|application/pdf|e9189eea563fde074...|f14b1846499dd4bd0...|JVBERi0xLjQNJeLjz...|
|http://geocities....|          sharar.pdf|      pdf|application/octet...|application/pdf|771f5bd1b72b8e324...|9cef1f6af9e5c127e...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008UTIL...|      pdf|application/octet...|application/pdf|7f45c93d16823e852...|b3a2d3b95efd77bd6...|JVBERi0xLjQNJeLjz...|
|http://geocities....|BakweriMarginalis...|      pdf|application/octet...|application/pdf|d25863303ba46a872...|bbd6c9bce4c523f0f...|JVBERi0xLjINJeLjz...|
|http://geocities....|McCallaFoodSecuri...|      pdf|application/octet...|application/pdf|1291b633f49f7e51d...|622144ed0fd56bae3...|JVBERi0xLjMNJeLjz...|
|http://geocities....|PovertyAndIncome.pdf|      pdf|application/octet...|application/pdf|278e1f281905d419d...|9bc00a54147a4b350...|JVBERi0xLjIgDSXi4...|
|http://geocities....|          behold.pdf|      pdf|application/octet...|application/pdf|9fc1e4e1e0f567477...|63d324984d34eb168...|JVBERi0xLjMKJcfsj...|
|http://geocities....|overcome_the_fear...|      pdf|application/octet...|application/pdf|fe3ec0805564cd3fc...|d0d30ba4f7f40434d...|JVBERi0xLjMKJcfsj...|
|http://geocities....|           raven.pdf|      pdf|application/octet...|application/pdf|acabc7f7dba954f99...|1ddf3e53813a805a1...|JVBERi0xLjMKJcfsj...|
|http://geocities....|          sunset.pdf|      pdf|application/octet...|application/pdf|1dc037712d47b11d9...|f502ca5cc2de2483b...|JVBERi0xLjMKJcfsj...|
|http://geocities....|night_lasts_less_...|      pdf|application/octet...|application/pdf|1cda3dfab3bedaf04...|ad0f6e6fd53e4eb5f...|JVBERi0xLjMKJcfsj...|
|http://geocities....|      angel_dust.pdf|      pdf|application/octet...|application/pdf|92d14676e34dfcb7e...|1588b870928d56667...|JVBERi0xLjMKJcfsj...|
|http://geocities....|         vampire.pdf|      pdf|application/octet...|application/pdf|f1730689d52b9524e...|bf377a4e2580b8a29...|JVBERi0xLjMKJcfsj...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Presentation Program Files Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).presentationProgramFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from presentation program files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|           index.pps|      pps|application/mspow...|application/vnd.m...|fbaed5a1df163270a...|afa4c82593ea5bfd6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneCa...|      ppt|application/mspow...|application/vnd.m...|f5fde5813a5aef2f3...|e791212ac91243f39...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|AD1-GE-Quiz4-Samp...|      ppt|application/mspow...|application/vnd.m...|f5824d64bb74b1377...|aaea2a38d11682753...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|    Agrarianism1.ppt|      ppt|application/mspow...|application/vnd.m...|f581932d9e4c57dc0...|3fbce2d175be293a8...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            lego.pps|      pps|application/mspow...|application/vnd.m...|f0da5c58e7abbf102...|78bc45da68c6784be...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            HPIB.ppt|      ppt|application/mspow...|application/vnd.m...|ef09c31bd8079d40b...|875a96d8b8dd3bf18...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|learningdisabilit...|      ppt|application/mspow...|application/vnd.m...|e6bb4f98761839a3a...|5a4dcc8bab2ee15f3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|assessmentsummer.ppt|      ppt|application/mspow...|application/vnd.m...|e116a443b9031ec01...|141563f2f32687587...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|CommonlyConfusedW...|      ppt|application/mspow...|application/vnd.m...|dde43870e0da8ebf6...|7a94bf766d931a046...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|AD1-Unit5-Achieve...|      ppt|application/mspow...|application/vnd.m...|d4530e506c2e41f8f...|6c89c0e3d28ecceed...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Schwind.ppt|      ppt|application/mspow...|application/vnd.m...|cfdd4bb6e7b04f24a...|9c26a8ac091f88a35...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  cpphtp4_PPT_07.ppt|      ppt|application/mspow...|application/vnd.m...|cd98e6e18c3b0ada0...|b3651507f61bafa4d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          mylife.ppt|      ppt|application/mspow...|application/vnd.m...|cb146894f8a544ace...|0129cfdfd2f196346...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    refinterview.ppt|      ppt|application/mspow...|application/vnd.m...|ca6fd4ec5fcb8237d...|8312ca4c0dbeb6008...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneAl...|      ppt|application/mspow...|application/vnd.m...|c887f45fa58f273b0...|b253b732f8502f357...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|   ch2-DataTypes.ppt|      ppt|application/mspow...|application/vnd.m...|c74caee72b5ee6684...|f3bf878c775e2f72a...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|geographyofnortha...|      ppt|application/mspow...|application/vnd.m...|c35b93ac59f2eb5af...|b5de05a856838328c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       people1st.ppt|      ppt|application/mspow...|application/vnd.m...|bf19cdc1ff3ad82fd...|99f14fe81d8a9587f...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AD1-Reading.ppt|      ppt|application/mspow...|application/vnd.m...|be020b4564f972218...|0761a2fd5c176ce1c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         majalah.ppt|      ppt|application/mspow...|application/vnd.m...|b6f219693ef1df49f...|1039013624cf8de35...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the presentation program files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).presentationProgramFiles();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.presentation_program()
df.show()

Will extract all following information from presentation program files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|          wincvs.ppt|      ppt|application/mspow...|application/vnd.m...|52ac23b58493234b2...|a2206af9847cceb06...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           index.pps|      pps|application/mspow...|application/vnd.m...|fbaed5a1df163270a...|afa4c82593ea5bfd6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneCa...|      ppt|application/mspow...|application/vnd.m...|f5fde5813a5aef2f3...|e791212ac91243f39...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryone7t...|      ppt|application/mspow...|application/vnd.m...|9893643e1cb87af0c...|2fa8301893ad21b2b...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneGe...|      ppt|application/mspow...|application/vnd.m...|2a914a95a61b227dd...|5d783c1beaffc0b57...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneAl...|      ppt|application/mspow...|application/vnd.m...|c887f45fa58f273b0...|b253b732f8502f357...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryone7t...|      ppt|application/mspow...|application/vnd.m...|034906471a0c0b997...|16142a0aa69b2fb1f...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|           tiago.ppt|      ppt|application/mspow...|application/vnd.m...|6871786192c187783...|e5a91a65ef9a4bade...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        energypp.ppt|      ppt|application/mspow...|application/vnd.m...|94f9384ec57d8849c...|e943c5cf509f8f816...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            lego.pps|      pps|application/mspow...|application/vnd.m...|f0da5c58e7abbf102...|78bc45da68c6784be...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      celtiberos.pps|      pps|application/mspow...|application/vnd.m...|af897525acd31d359...|9c018a80253c38a57...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          porque.pps|      pps|application/mspow...|application/vnd.m...|9c2cba37c64fd0ac8...|6f11733ddec0abc2d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|SoftHandoffbyPara...|      pps|application/mspow...|application/vnd.m...|0c5ef732ea466574f...|dc7dfe545b401aeab...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|A_Land_Remembered...|      ppt|application/mspow...|application/vnd.m...|5b7273d03f8490490...|2d8721e7876cb6697...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           DANCE.ppt|      ppt|application/mspow...|application/vnd.m...|5aa3308433666a30a...|4a23bd20768501dac...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            unit.ppt|      ppt|application/mspow...|application/vnd.m...|6736886864069ee66...|e92031e6e0293cb73...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         majalah.ppt|      ppt|application/mspow...|application/vnd.m...|b6f219693ef1df49f...|1039013624cf8de35...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|esos_si_son_probl...|      pps|application/mspow...|application/vnd.m...|932221045b6154d7e...|b23a0238c852d28bb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    refinterview.ppt|      ppt|application/mspow...|application/vnd.m...|ca6fd4ec5fcb8237d...|8312ca4c0dbeb6008...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Schwind.ppt|      ppt|application/mspow...|application/vnd.m...|cfdd4bb6e7b04f24a...|9c26a8ac091f88a35...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Spreadsheet Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).spreadsheets();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from spreadsheet files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|     statuscarib.xls|      xls|application/vnd.m...|application/vnd.m...|f9fd18b158df52ff2...|0d606f25ac3c9abc4...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       timesheet.xls|      xls|application/vnd.m...|application/vnd.m...|f9549db15de69bc21...|e9c239d812705842f...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|statusccusspring0...|      xls|application/vnd.m...|application/vnd.m...|ef99704e5a734f386...|f265fc5c581ad1762...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|  Laboratorio_05.xls|      xls|application/vnd.m...|application/vnd.m...|eb0e39898ba513234...|976f69da07122d285...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|110_Laboratorio_6...|      xls|application/vnd.m...|application/vnd.m...|e5b7fee6d4c45e171...|befd9670be70a4fdb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          Pakuan.xls|      xls|application/vnd.m...|application/vnd.m...|e386f85a7bd74b1ab...|5b2b142de2c57ec68...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Spring08_statusre...|      xls|application/vnd.m...|application/vnd.m...|df2d6792fb55c4e26...|6f4d2aef711aff4e1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|dc987d3e996677ce9...|40bb63a4c0038a6ae...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|MalibuTrailChalle...|      xls|application/vnd.m...|application/vnd.m...|dbba76ead82576178...|ffbe099441053b47b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     CTtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|d9ee9117e70df43b5...|596c4c6d5cdc7ddb5...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|  1071_Parcial_2.xls|      xls|application/vnd.m...|application/vnd.m...|d90dc210138676a2c...|6e3ed07f50393815c...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|excelsubtractione...|      xls|application/vnd.m...|application/vnd.m...|d6c8314e52f22e4aa...|1b1ebce0f85628921...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Fall2008_statusre...|      xls|application/vnd.m...|application/vnd.m...|cd9974430477b75ce...|0e756bbc38608cb51...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        report01.xls|      xls|application/vnd.m...|application/vnd.m...|cd947fe4099df4fe3...|0f11d17d38a72977b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|TrackRecords20010...|      xls|application/vnd.m...|application/vnd.m...|c8aa0122443efa0e5...|fa9cdb4a329f926bf...|0M8R4KGxGuEAAAAAA...|
|http://br.geociti...|  mycoinsforswap.xls|      xls|application/vnd.m...|application/vnd.m...|c665c83bc2b54292f...|18f1f3a4559d5c40a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AAtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|c66201762bf5e473e...|4e9bac4f217b0605d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|   carwashroster.xls|      xls|application/vnd.m...|application/vnd.m...|c495d1b7dc954b975...|062167485baf9aa5d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         RSL_MDP.xls|      xls|application/vnd.m...|application/vnd.m...|bf6479bacbb758b52...|4d7ea33849447853d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  status_report4.xls|      xls|application/vnd.m...|application/vnd.m...|bc4d18e022522d185...|fc7b9fc64116c9ad1...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the spreadsheet files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).spreadsheets();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.spreadsheets()
df.show()

Will extract all following information from spreadsheet files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|        tkadrosu.xls|      xls|application/vnd.m...|application/vnd.m...|8033532f88da42ad6...|a52b24bc760c5265b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     cal_counter.xls|      xls|application/vnd.m...|application/vnd.m...|56ad6c2f84fdd4a88...|ad0db35f2d7ff2cca...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|dc987d3e996677ce9...|40bb63a4c0038a6ae...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AAtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|c66201762bf5e473e...|4e9bac4f217b0605d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     CTtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|d9ee9117e70df43b5...|596c4c6d5cdc7ddb5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|a4ed4330d5c18f1b2...|d8ce479596d49679d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           teams.xls|      xls|application/vnd.m...|application/vnd.m...|334fa42776cef7f81...|aa57fda7fb634c931...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      collection.xls|      xls|application/vnd.m...|application/vnd.m...|30d7a67de8150f712...|841ba91f009d48b7a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|music-collection.xls|      xls|application/vnd.m...|application/vnd.m...|4def75fa96bae579d...|090a95923c9599454...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          020103.xls|      xls|application/vnd.m...|application/vnd.m...|48651a7592ca1b0f0...|1e2438c8247d33870...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          011803.xls|      xls|application/vnd.m...|application/vnd.m...|0aab8ed40f91c1c76...|8e02e408fe1ce40b9...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     RSL_TorOton.xls|      xls|application/vnd.m...|application/vnd.m...|1d9c13c6407a2b696...|007010ecf5b208453...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         members.xls|      xls|application/vnd.m...|application/vnd.m...|b045a6b118981c6eb...|3ae096d6602b7cb36...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|        round309.xls|      xls|application/vnd.m...|application/vnd.m...|50bed4b3e9facb278...|f26e0c38082141598...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|       result109.xls|      xls|application/vnd.m...|application/vnd.m...|2235d094897f10c3b...|6ed0b65fd43502a2b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|TrackRecords20010...|      xls|application/vnd.m...|application/vnd.m...|c8aa0122443efa0e5...|fa9cdb4a329f926bf...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           Digox.xls|      xls|application/vnd.m...|application/vnd.m...|182d08821797269c7...|80e7ce8ecc1ecf389...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       RSL_SemBA.xls|      xls|application/vnd.m...|application/vnd.m...|59613700fbf08b795...|44eac99a514141520...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         RSL_MDP.xls|      xls|application/vnd.m...|application/vnd.m...|bf6479bacbb758b52...|4d7ea33849447853d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       RSL_ARG99.xls|      xls|application/vnd.m...|application/vnd.m...|a2f2fd063dd5689a7...|61568e0f4139ec568...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Video Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).videos();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from videos in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|       videohead.avi|      avi|     video/x-msvideo|video/x-msvideo|fa9852748ba7b4829...|0be56f200f8e1cb83...|UklGRjoMIQBBVkkgT...|
|http://www.geocit...|       HandWrap2.avi|      avi|     video/x-msvideo|video/x-msvideo|f680cb463e7cb291e...|1d2ea1df3f5af2599...|UklGRrBrAgBBVkkgT...|
|http://geocities....|         1kungfu.avi|      avi|     video/x-msvideo|video/x-msvideo|f4429277ed4b48efb...|5c542e8990efd484b...|UklGRkoSFwBBVkkgT...|
|http://geocities....|  Vol_III_sample.mpg|      mpg|          video/mpeg|     video/mpeg|f2bc34f7294edc376...|a939dc619c123f81b...|AAABuiEAAdLxgA7xA...|
|http://geocities....|         wherego.avi|      avi|     video/x-msvideo|video/x-msvideo|f23976ddeb6f08810...|714a9a548f9b2a156...|UklGRkq4HgBBVkkgT...|
|http://geocities....|       couch100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|ee316d5871acb7859...|0593ebb8e450a6c3e...|MCaydY5mzxGm2QCqA...|
|http://geocities....|    Mitwa_Lagaan.mp3|       qt|          audio/mpeg|video/quicktime|ebc5db8d30edd0135...|d3ebdd6da2c732481...|AAAE6W1vb3YAAABsb...|
|http://www.geocit...|       tydunking.mpg|      mpg|          video/mpeg|     video/mpeg|eaa0d14dc05bdab98...|05d4ff2301d2d3818...|AAABuiEAAQALgBcdA...|
|http://geocities....|       bigjleroy.avi|      avi|     video/x-msvideo|video/x-msvideo|e93538f0d76b86cca...|ebeb89fc2fa8f7cd6...|UklGRrjUCgBBVkkgT...|
|http://geocities....|     NollieBs180.mov|      mov|     video/quicktime|video/quicktime|e7b97c287329340d5...|138fb8b0dea4c8e16...|AAAGwm1vb3YAAABsb...|
|http://www.geocit...|           shirt.avi|      avi|     video/x-msvideo|video/x-msvideo|e36119d3c78225cbf...|11af72475ca754639...|UklGRvhdHQBBVkkgT...|
|http://geocities....|          atdawn.wma|      asf|      audio/x-ms-wma| video/x-ms-asf|e1a85a79ea3ba5d96...|1be05aecdff99298c...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|non_will_go_to_wa...|      mov|     video/quicktime|video/quicktime|de6cc975363c4076b...|0c00d0be9c89f9e97...|AAs4JW1kYXQAAA70A...|
|http://geocities....|      Movies_20.mpeg|     mpeg|          video/mpeg|     video/mpeg|dd9d2af0c1318b5ff...|9d06f09744fe93408...|AAABuiEAV+PlgAU7A...|
|http://www.geocit...|        artilery.mpg|      mpg|          video/mpeg|     video/mpeg|dcecbdfe46448bffb...|0b292aab1078d9bfa...|AAABswsAkBP//+CkA...|
|http://geocities....| tancfigurakbbpl.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|dca4991392572dbc0...|cb349bdc35484d976...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|          Trevi2.mov|      mov|     video/quicktime|video/quicktime|dc882205f5cae38f5...|c9dd804e1ee140221...|AAAEvG1vb3YAAAS0Y...|
|http://www.geocit...|skillful_driving_...|      mpg|          video/mpeg|     video/mpeg|db8a767b00884e426...|f5a70cf5f091b530f...|AAABuiEAAQABgAORA...|
|http://geocities....|      jeremy100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|dafba744438ae0110...|d3a217ce25507ae90...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|           mbrl2.mpg|      mpg|          video/mpeg|     video/mpeg|d8eb5a12f0da99ca0...|8686002a444cc9dce...|AAABswsAkBP//+CkA...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the video files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).videos();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.video()
df.show()

Will extract all following information from videos in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://www.geocit...|Sea_Dawgs_2008_Ha...|      wmv|      video/x-ms-wmv| video/x-ms-wmv|7b35e4cf60a3cfa67...|b35ad7242e8135326...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|       Excedrine.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|0aaf1d81ab6f2b354...|0b52af5f5facfd30f...|MCaydY5mzxGm2QCqA...|
|http://geocities....|        homework.avi|      avi|     video/x-msvideo|video/x-msvideo|4e06cbd11764cd2ac...|770a8849375965b20...|UklGRsrLAgBBVkkgT...|
|http://geocities....|    macarenababy.avi|      avi|     video/x-msvideo|video/x-msvideo|600084bbd732c0fda...|f99b2e31374d4ea18...|UklGRrC0AwBBVkkgT...|
|http://geocities....|orlando_viggokiss...|      wmv|      video/x-ms-asf| video/x-ms-wmv|79d093eb6184dba74...|395eaf6dcb29a66d2...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|skillful_driving_...|      mpg|          video/mpeg|     video/mpeg|db8a767b00884e426...|f5a70cf5f091b530f...|AAABuiEAAQABgAORA...|
|http://www.geocit...|gray_havens_2.35.MPG|      mpg|          video/mpeg|     video/mpeg|af71353d69af0b42f...|f3625b897339b0f23...|AAABuiEAAQABgAORA...|
|http://geocities....|         movie2.mpeg|     mpeg|          video/mpeg|     video/mpeg|3f6c7c48d2a990cf2...|760e6752bfd9e8a84...|AAABsxQA8MMCcSClE...|
|http://www.geocit...|       Sequence1.mov|      mov|     video/quicktime|video/quicktime|931fc4dee8aa260f9...|5a5cf58e2a50cf942...|AAELrG1vb3YAAABsb...|
|http://www.geocit...|           santa.mov|      mov|     video/quicktime|video/quicktime|8b9b98d0c567c4381...|49f49dd23c3bad61b...|AAAAIGZ0eXBxdCAgI...|
|http://geocities....|          0602-2.avi|      avi|     video/x-msvideo|video/x-msvideo|92d04dbe7f1bdc109...|65ed7327aece11bac...|UklGRkauOABBVkkgT...|
|http://geocities....|           movie.mpg|      mpg|          video/mpeg|     video/mpeg|a0e86539e5eb9bd35...|82eb4680a9f65ed1b...|AAABuiEAAdLxgASfA...|
|http://geocities....|     misshawaii.mpeg|     mpeg|          video/mpeg|     video/mpeg|45cbfc4d03547861b...|44c93f871ea602112...|AAABuiEAAQABgAORA...|
|http://geocities....|      Explosions.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|22cb24bffbd7eabf9...|a44d261ef5d7e7993...|MCaydY5mzxGm2QCqA...|
|http://geocities....|       couch100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|ee316d5871acb7859...|0593ebb8e450a6c3e...|MCaydY5mzxGm2QCqA...|
|http://geocities....|      jeremy100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|dafba744438ae0110...|d3a217ce25507ae90...|MCaydY5mzxGm2QCqA...|
|http://geocities....|       jedi_wade.mov|      mov|     video/quicktime|video/quicktime|674688fd09bf18d29...|cd21c3a5b9e2f18b6...|AAAFB21vb3YAAAT/Y...|
|http://geocities....|ylagallinanonosga...|      asf|      audio/x-ms-wma| video/x-ms-asf|9aac473134d7f2e7a...|3af7fbab238772f48...|MCaydY5mzxGm2QCqA...|
|http://geocities....|Chris-5050NollieS...|      mov|     video/quicktime|video/quicktime|93aa2ce07e01f90ad...|f066f29e5faf0cee1...|AAAHRG1vb3YAAABsb...|
|http://geocities....| floursack_jump2.avi|      avi|     video/x-msvideo|video/x-msvideo|a922441c0a7f0018d...|b82ca6fe1d46e16dc...|UklGRgjlAwBBVkkgT...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Word Processor Files Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).wordProcessorFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from word processor files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|http://geocities....|infiniteproducts.doc|      doc|  application/msword|application/msword|ffa1ea83af6cb9508...|7a3ae86a7a22d2682...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      Everything.doc|      doc|  application/msword|application/msword|ff7216edf86fe196c...|082a889c27640fc9a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          survey.doc|      doc|  application/msword|application/msword|ff48df5e64bd5adeb...|383ab6ead48795ff3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....| iepWrkshpFall01.doc|      doc|  application/msword|application/msword|ff421feb87b826d39...|ec60a48d393642629...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|24_reproduction_s...|      doc|  application/msword|application/msword|fec21eb30fac4588e...|36b41ba66801b10b9...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|Descendit_ad_Infe...|      doc|  application/msword|application/msword|fe66eeb7c04942c8b...|14f207787abef983e...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Anthropology21FEB...|      doc|  application/msword|application/msword|fe079d498bd5e91f2...|ca54e6be7c0618ecc...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          Senses.doc|      doc|  application/msword|application/msword|fdf881ef998c227f7...|04d6e72132537053a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|hopewel-loudon-cl...|      doc|  application/msword|application/msword|fddffbabcaf1976c9...|b7ade5d661dd597a1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|   billmprev9899.doc|      doc|  application/msword|application/msword|fdcc8b65cfb0a18c9...|602f323278c9fb726...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|approachesProject...|      doc|  application/msword|application/msword|fd4df7f89efe9cea7...|4e7be7664bfe992f3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         batayan.doc|      doc|  application/msword|application/msword|fc6f45fdfce72d4a3...|e614c9b9e95d64aa6...|0M8R4KGxGuEAAAAAA...|
|http://geocities....| VisitUnitPacket.doc|      doc|  application/msword|application/msword|fc2a0e45b627c3d4a...|dc7ba874b7b13d548...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|vc3c3ppstudyguide...|      doc|  application/msword|application/msword|fc293bbddb906615f...|538aa0d5e2f554258...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|30_chordates_fish...|      doc|  application/msword|application/msword|fc053770a82822f69...|9df86863983889373...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|c6artposterexampl...|      doc|  application/msword|application/msword|fbe2427b48f32d1d9...|47de792202dc3a059...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|        kun20509.doc|      doc|  application/msword|application/msword|fb8d1ae5e3db45131...|6b13d73759a956e62...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|            kun20509|      doc|  application/msword|application/msword|fb8d1ae5e3db45131...|6b13d73759a956e62...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Fishing.doc|      doc|  application/msword|application/msword|fb7df7ac80aa2cc8a...|eb4bb266226349bac...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      resumedoAw.doc|      doc|  application/msword|application/msword|fb6d5bf501b9b97b3...|1e0d6500192d4ee21...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the word processor files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).wordProcessorFiles();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.word_processor()
df.show()

Will extract all following information from word processor files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|http://geocities....|            Doc2.doc|      doc|  application/msword|application/msword|09159efbefff59f64...|5412d6c55c2c8bec7...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-ITjob.doc|      doc|  application/msword|application/msword|7f2b7540e558de24e...|96a6ece7202ab309b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-Teach.doc|      doc|  application/msword|application/msword|637bb22eff4bc5be5...|76130b6ffeac5c678...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       CV-covlet.doc|      doc|  application/msword|application/msword|466c06bfa5a47d5cb...|dc763126cbdb589eb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-extra.doc|      doc|  application/msword|application/msword|ab0fa931229c02a4b...|4c2a8200e6eaaafb2...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|020410Indonesia_N...|      doc|  application/msword|application/msword|b195e90841347be61...|6d2845902ad15a9a2...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        Chapter1.doc|      doc|  application/msword|application/msword|65383c8c0cf5b6a4f...|fcf3008e9478b773c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|CathyKoning_resum...|      doc|  application/msword|application/msword|924ad3f9f66d3c6bd...|2d0887c93ffd3e78b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  Greek_colonels.doc|      doc|  application/msword|application/msword|ee4b9db827086d0db...|94e5569e064195db5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         resume2.doc|      doc|  application/msword|application/msword|c39fa601733093268...|108563de6ba6102a5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     eta_writeup.doc|      doc|  application/msword|application/msword|661328d76ce3aa340...|debadb248da4dfbd3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Before_Night_Fall...|      doc|  application/msword|application/msword|a40371b35b4bf0838...|8f1dba8a46ea297b8...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Membership_Form_2...|      doc|  application/msword|application/msword|bf3a3b8cc86b371c3...|472810e93a2245fb1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  walkthroughff1.doc|      doc|  application/msword|application/msword|c97de6941c3fb4aed...|16851a5445bdce07d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|    Encyclopedia.doc|      doc|  application/msword|application/msword|26a94e8f3358c878c...|07f9b2ce6342f73bc...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        Y.Kurulu.doc|      doc|  application/msword|application/msword|8e0ebe7c4f27b1841...|ebb5ce328f717f8e6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|      fifty_eggs.doc|      doc|  application/msword|application/msword|2c1cdd4f75030650e...|d022311b2fc399750...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     1pitagoras2.doc|      doc|  application/msword|application/msword|e07ff47cb8ebc4356...|97d46d781458f5a82...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    constitution.doc|      doc|  application/msword|application/msword|e38dc3e5d553d8799...|d50096b5208146ce9...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     feasibility.doc|      doc|  application/msword|application/msword|5574bf82d65935191...|53de74880c9ea2e2b...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
only showing top 20 rows