Binary Analysis · Archives Unleashed Toolkit

Extract Audio Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).audio();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from audio files in a web collection:

audio url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://geocities....|    capasoligero.mp3|      mp3|          audio/mpeg|    audio/mpeg|fffd1aa802392be0f...|88e254b4cab7848a9...|//MozAAAAAAAAAAAA...|
|http://www.geocit...|        colorwnd.mid|      mid|          audio/midi|    audio/midi|fff3f4e8a473f7c9a...|aea92a6f32dd1a1f4...|TVRoZAAAAAYAAQAGA...|
|http://geocities....|santana_rob_thoma...|      mid|          audio/midi|    audio/midi|ffd4a24d4e4722d94...|28576c271898a1de5...|TVRoZAAAAAYAAQASA...|
|http://geocities....|           music.mid|      mid|          audio/midi|    audio/midi|ffcbe35e28b553481...|cf1ebdbe1a070d4f6...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        evrythng.mid|      mid|          audio/midi|    audio/midi|ff751c86728ff09b5...|d22fc0911d3ceb17a...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        evrythn2.mid|      mid|          audio/midi|    audio/midi|ff751c86728ff09b5...|d22fc0911d3ceb17a...|TVRoZAAAAAYAAAABA...|
|http://geocities....|          picket.mid|      mid|          audio/midi|    audio/midi|ff4d225a602630584...|ecef0a851cc028853...|TVRoZAAAAAYAAQAHA...|
|http://geocities....|        simpsons.mid|      mid|          audio/midi|    audio/midi|ff3bc375860979f2f...|9c1204dad686ddeea...|TVRoZAAAAAYAAQAPA...|
|http://www.geocit...|        simpsons.mid|      mid|          audio/midi|    audio/midi|ff3bc375860979f2f...|9c1204dad686ddeea...|TVRoZAAAAAYAAQAPA...|
|http://geocities....|        mypretty.wav|      wav|         audio/x-wav|audio/vnd.wave|ff1a5015d3a380955...|113de5c1bb2f7ddb4...|UklGRvz8AABXQVZFZ...|
|http://geocities....|          song37.mid|      mid|          audio/midi|    audio/midi|fee0a67ff7c71e35c...|ccd4fdfa0483d1058...|TVRoZAAAAAYAAAABA...|
|http://geocities....|    holdyourhand.mid|      mid|          audio/midi|    audio/midi|fed14ecd7099e3fb9...|24fe5c097db5d506a...|TVRoZAAAAAYAAQANA...|
|http://geocities....|    es_tu_sangre.mid|      mid|          audio/midi|    audio/midi|fec196e8086d868f2...|eccb1551d1e7b236e...|TVRoZAAAAAYAAQASA...|
|http://www.geocit...|          virgin.mid|      mid|          audio/midi|    audio/midi|fec0ce795723b1287...|cc651312b1d57fe64...|TVRoZAAAAAYAAQAMA...|
|http://www.geocit...|tonibraxtonunbrea...|      wav|         audio/x-wav|audio/vnd.wave|feb7e31a8edb0a484...|9420bdeece0f23b78...|UklGRtQoCgBXQVZFZ...|
|http://geocities....|      comeandsee.mid|      mid|          audio/midi|    audio/midi|feb513cd7b6fab9cc...|51b4c2bb113cb43aa...|TVRoZAAAAAYAAAABA...|
|http://geocities....|        song186t.mid|      mid|          audio/midi|    audio/midi|fead61a5a439675a3...|c652eda8a4ec5d197...|TVRoZAAAAAYAAAABA...|
|http://geocities....|    be_magnified.mid|      mid|          audio/midi|    audio/midi|feac0e996e1555d84...|f51ec1e62a166fa82...|TVRoZAAAAAYAAQAPA...|
|http://geocities....|        EVERYBOD.MID|      mid|          audio/midi|    audio/midi|fea911b19f0cf709d...|58bcd1b3c0288cbe0...|TVRoZAAAAAYAAQAUA...|
|http://www.geocit...|        ff9waltz.mid|      mid|          audio/midi|    audio/midi|fe9eb1ea6d4b53a9f...|72e2467bfea6240b8...|TVRoZAAAAAYAAQAKA...|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+

If you wanted to work with all the audio files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).audio();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.audio()
df.show()

Will extract all following information from audio files in a web collection:

audio url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|http://www.geocit...|        hc-tibet.wav|      wav|         audio/x-wav|audio/vnd.wave|416ad26133f63dc3e...|dfb764d759187d102...|UklGRg6eAABXQVZFZ...|
|http://geocities....|bookmarkthissite.wav|      wav|         audio/x-wav|audio/vnd.wave|7897ff71780a903ca...|cfb942aeb3bc881cd...|UklGRppkAABXQVZFZ...|
|http://geocities....|   NeilYoung-Hey.mp3|      mp3|          audio/mpeg|    audio/mpeg|40869eb3181e6035b...|19fa693521cd8125c...|//uQRAAAAcAAsNUEA...|
|http://geocities....|          misty1.mp3|      mp3|          audio/mpeg|    audio/mpeg|d8cb3ce54072a7d4b...|43b92e16932c13a43...|//uQBAAAAsJl22mBE...|
|http://geocities....|            sale.mid|      mid|          audio/midi|    audio/midi|5dfc0c3dd884e50c7...|071840b4822ae5e80...|TVRoZAAAAAYAAQALA...|
|http://geocities....|        swaplink.mid|      mid|          audio/midi|    audio/midi|f32117ce2bffa9902...|0346223861c87acc1...|TVRoZAAAAAYAAQALA...|
|http://geocities....|              m5.mid|      mid|          audio/midi|    audio/midi|7e5eedebafecd26c4...|393dfbc00c49fcdc9...|TVRoZAAAAAYAAQAJA...|
|http://geocities....|          morder.mid|      mid|          audio/midi|    audio/midi|6cec0785377f5bbaf...|a94f0a75c0c3b3cf5...|TVRoZAAAAAYAAQAMA...|
|http://geocities....|              m2.mid|      mid|          audio/midi|    audio/midi|58b0102f997e689a2...|51ad469ebc931e160...|TVRoZAAAAAYAAQALA...|
|http://geocities....|           music.mid|      mid|          audio/midi|    audio/midi|7917a5a9d6ddfb8dd...|009db9df73cdf5247...|TVRoZAAAAAYAAQALA...|
|http://www.geocit...|        hcpopeye.wav|      wav|         audio/x-wav|audio/vnd.wave|04d7b45c70e0a496e...|9db0e61c16554af88...|UklGRrbAAABXQVZFZ...|
|http://geocities....|              m7.mid|      mid|          audio/midi|    audio/midi|3906ecaba32ba15a8...|e0d6e9f1c86b6204e...|TVRoZAAAAAYAAQAHA...|
|http://geocities....|           words.mid|      mid|          audio/midi|    audio/midi|30da01a4ed42ae469...|160b2e5aaa9b95641...|TVRoZAAAAAYAAQAIA...|
|http://geocities....|          brock5.mp3|      mp3|          audio/mpeg|    audio/mpeg|17f4e1c7a007983a5...|3bbdb27fafa4e8b12...|//MozAANkCLE/gjGA...|
|http://geocities....|          brock1.mp3|      mp3|          audio/mpeg|    audio/mpeg|67db65825afc326ed...|2ec4ac110cff19134...|//MozAAMyX7VmBjGl...|
|http://geocities....|       funkytown.wav|      wav|         audio/x-wav|audio/vnd.wave|6f841bcffe4bbb61d...|ab1fdb143d5752cf1...|UklGRlLOCQBXQVZFZ...|
|http://geocities....|  welcomemyworld.mid|      mid|          audio/midi|    audio/midi|c546eac675e2dd974...|cb4f1fa32aa1e3205...|TVRoZAAAAAYAAQAMA...|
|http://www.geocit...|        irisheye.mid|      mid|          audio/midi|    audio/midi|d906f32953742fdef...|f3ca7449483b0ea65...|TVRoZAAAAAYAAQAFA...|
|http://geocities....|       mission21.mid|      mid|          audio/midi|    audio/midi|c507304afe6cddba1...|72a74c1914044746f...|TVRoZAAAAAYAAQAVA...|
|http://geocities....|         tellit1.mid|      mid|          audio/midi|    audio/midi|a604ae85251d55504...|95096668900a76dc8...|TVRoZAAAAAYAAQAQA...|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract PDF Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).pdfs();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from PDF files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|adicec_sopar_2009...|      pdf|application/octet...|application/pdf|ffc2ccc373b8ffd39...|3831b0f228af1701e...|JVBERi0xLjMNJeLjz...|
|http://www.geocit...|      IntSt_2301.pdf|      pdf|application/octet...|application/pdf|ffa638c418dac2e19...|84dbaccde1ace4b24...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|            lotg.pdf|      pdf|application/octet...|application/pdf|ff871ef64d3739b03...|95a777f0b4c7703c6...|JVBERi0xLjINJeLjz...|
|http://geocities....|            ebad.pdf|      pdf|application/octet...|application/pdf|fe8feece5d08dc2ce...|0c01cc31b40a286da...|JVBERi0xLjMNJeLjz...|
|http://geocities....|      regulament.pdf|      pdf|application/octet...|application/pdf|fe8018451633fd76c...|9c7cc720e29cad6e8...|JVBERi0xLjMKJcfsj...|
|http://geocities....|dmatias_letterfor...|      pdf|application/octet...|application/pdf|fe7dbc89e664ba790...|dbe965e7a288cce59...|JVBERi0xLjYNJeLjz...|
|http://geocities....|overcome_the_fear...|      pdf|application/octet...|application/pdf|fe3ec0805564cd3fc...|d0d30ba4f7f40434d...|JVBERi0xLjMKJcfsj...|
|http://geocities....|       CIM_marks.pdf|      pdf|application/octet...|application/pdf|fe1622ac08b47cf60...|b97b57b3c77887324...|JVBERi0xLjMKJcTl8...|
|http://geocities....|           board.PDF|      pdf|application/octet...|application/pdf|fd969b57508d3b135...|fc121c07fefbb722b...|JVBERi0xLjIgDQol4...|
|http://geocities....|          cowell.pdf|      pdf|application/octet...|application/pdf|fbacc01cbe01aa0b4...|f9e9eba1b281ad800...|JVBERi0xLjMKJeLjz...|
|http://geocities....|        gdbrasil.pdf|      pdf|application/octet...|application/pdf|fadc9b9b2408a1112...|247671acb971ddc21...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|         EBOrder.pdf|      pdf|application/octet...|application/pdf|fa4a83d96441324b3...|5f6870832d035a5a9...|JVBERi0xLjINJeLjz...|
|http://geocities....|        butlleta.pdf|      pdf|application/octet...|application/pdf|fa13dfbf62acb5083...|9a8ec0c0e8a190f46...|JVBERi0xLjQNJeLjz...|
|http://www.geocit...|ALABAMAUNDERWOODM...|      pdf|application/octet...|application/pdf|f9791c7df35d9092a...|3e4c0ca1031152d24...|JVBERi0xLjIgDQol4...|
|http://geocities....|         chimera.pdf|      pdf|application/octet...|application/pdf|f92a40f58cffcdc8e...|ba038d0146b0171f2...|JVBERi0xLjMKJcfsj...|
|http://geocities....|          icarus.pdf|      pdf|application/octet...|application/pdf|f8da963b714e684b3...|4444f5a12c9dbb1df...|JVBERi0xLjMKJcfsj...|
|http://geocities....|2008_ClubFinances...|      pdf|application/octet...|application/pdf|f878c0373edbc89f9...|700393c7b6aaf93df...|JVBERi0xLjQNJeLjz...|
|http://geocities....|  WILLOWSTScene5.pdf|      pdf|application/octet...|application/pdf|f84fc521602fdf163...|5f03b19201536cbc8...|JVBERi0xLjQKJcfsj...|
|http://geocities....|        isrherb2.pdf|      pdf|application/octet...|application/pdf|f83390642e9fe6313...|60befa2b5913bb19d...|JVBERi0xLjMNJeLjz...|
|http://geocities....|            joel.pdf|      pdf|application/octet...|application/pdf|f828e4b447c085fdd...|2e3308c1a52f2f75a...|JVBERi0xLjQKJcOkw...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the PDF files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).pdfs();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.pdfs()
df.show()

Will extract all following information from PDF files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|20080304ordinance...|      pdf|application/octet...|application/pdf|ebbf9bf99b363493b...|f0b9a6788cbc1f8ab...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008ADOP...|      pdf|application/octet...|application/pdf|4fe261c2210189a52...|a91180b9170ff757f...|JVBERi0xLjQNJeLjz...|
|http://geocities....|            Menu.pdf|      pdf|application/octet...|application/pdf|75e4d587589a1d85d...|d18724100d4616a45...|JVBERi0xLjMNJeLjz...|
|http://geocities....|DSTC2009ContestFl...|      pdf|application/octet...|application/pdf|c80f38f96480aab0c...|369c4415ed9c2476d...|JVBERi0xLjQNJeLjz...|
|http://geocities....|            ebad.pdf|      pdf|application/octet...|application/pdf|fe8feece5d08dc2ce...|0c01cc31b40a286da...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008APPR...|      pdf|application/octet...|application/pdf|8747971e78acb768b...|770f97a95c7e2ee16...|JVBERi0xLjQNJeLjz...|
|http://geocities....|FACTSHEET2008APPE...|      pdf|application/octet...|application/pdf|32f57bbe5b28f4ab1...|d4f63b8d29f4c5dc5...|JVBERi0xLjQNJeLjz...|
|http://geocities....|FACTSHEET2008ADOP...|      pdf|application/octet...|application/pdf|e9189eea563fde074...|f14b1846499dd4bd0...|JVBERi0xLjQNJeLjz...|
|http://geocities....|          sharar.pdf|      pdf|application/octet...|application/pdf|771f5bd1b72b8e324...|9cef1f6af9e5c127e...|JVBERi0xLjMNJeLjz...|
|http://geocities....|FACTSHEET2008UTIL...|      pdf|application/octet...|application/pdf|7f45c93d16823e852...|b3a2d3b95efd77bd6...|JVBERi0xLjQNJeLjz...|
|http://geocities....|BakweriMarginalis...|      pdf|application/octet...|application/pdf|d25863303ba46a872...|bbd6c9bce4c523f0f...|JVBERi0xLjINJeLjz...|
|http://geocities....|McCallaFoodSecuri...|      pdf|application/octet...|application/pdf|1291b633f49f7e51d...|622144ed0fd56bae3...|JVBERi0xLjMNJeLjz...|
|http://geocities....|PovertyAndIncome.pdf|      pdf|application/octet...|application/pdf|278e1f281905d419d...|9bc00a54147a4b350...|JVBERi0xLjIgDSXi4...|
|http://geocities....|          behold.pdf|      pdf|application/octet...|application/pdf|9fc1e4e1e0f567477...|63d324984d34eb168...|JVBERi0xLjMKJcfsj...|
|http://geocities....|overcome_the_fear...|      pdf|application/octet...|application/pdf|fe3ec0805564cd3fc...|d0d30ba4f7f40434d...|JVBERi0xLjMKJcfsj...|
|http://geocities....|           raven.pdf|      pdf|application/octet...|application/pdf|acabc7f7dba954f99...|1ddf3e53813a805a1...|JVBERi0xLjMKJcfsj...|
|http://geocities....|          sunset.pdf|      pdf|application/octet...|application/pdf|1dc037712d47b11d9...|f502ca5cc2de2483b...|JVBERi0xLjMKJcfsj...|
|http://geocities....|night_lasts_less_...|      pdf|application/octet...|application/pdf|1cda3dfab3bedaf04...|ad0f6e6fd53e4eb5f...|JVBERi0xLjMKJcfsj...|
|http://geocities....|      angel_dust.pdf|      pdf|application/octet...|application/pdf|92d14676e34dfcb7e...|1588b870928d56667...|JVBERi0xLjMKJcfsj...|
|http://geocities....|         vampire.pdf|      pdf|application/octet...|application/pdf|f1730689d52b9524e...|bf377a4e2580b8a29...|JVBERi0xLjMKJcfsj...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Presentation Program Files Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).presentationProgramFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from presentation program files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|           index.pps|      pps|application/mspow...|application/vnd.m...|fbaed5a1df163270a...|afa4c82593ea5bfd6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneCa...|      ppt|application/mspow...|application/vnd.m...|f5fde5813a5aef2f3...|e791212ac91243f39...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|AD1-GE-Quiz4-Samp...|      ppt|application/mspow...|application/vnd.m...|f5824d64bb74b1377...|aaea2a38d11682753...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|    Agrarianism1.ppt|      ppt|application/mspow...|application/vnd.m...|f581932d9e4c57dc0...|3fbce2d175be293a8...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            lego.pps|      pps|application/mspow...|application/vnd.m...|f0da5c58e7abbf102...|78bc45da68c6784be...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            HPIB.ppt|      ppt|application/mspow...|application/vnd.m...|ef09c31bd8079d40b...|875a96d8b8dd3bf18...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|learningdisabilit...|      ppt|application/mspow...|application/vnd.m...|e6bb4f98761839a3a...|5a4dcc8bab2ee15f3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|assessmentsummer.ppt|      ppt|application/mspow...|application/vnd.m...|e116a443b9031ec01...|141563f2f32687587...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|CommonlyConfusedW...|      ppt|application/mspow...|application/vnd.m...|dde43870e0da8ebf6...|7a94bf766d931a046...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|AD1-Unit5-Achieve...|      ppt|application/mspow...|application/vnd.m...|d4530e506c2e41f8f...|6c89c0e3d28ecceed...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Schwind.ppt|      ppt|application/mspow...|application/vnd.m...|cfdd4bb6e7b04f24a...|9c26a8ac091f88a35...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  cpphtp4_PPT_07.ppt|      ppt|application/mspow...|application/vnd.m...|cd98e6e18c3b0ada0...|b3651507f61bafa4d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          mylife.ppt|      ppt|application/mspow...|application/vnd.m...|cb146894f8a544ace...|0129cfdfd2f196346...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    refinterview.ppt|      ppt|application/mspow...|application/vnd.m...|ca6fd4ec5fcb8237d...|8312ca4c0dbeb6008...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneAl...|      ppt|application/mspow...|application/vnd.m...|c887f45fa58f273b0...|b253b732f8502f357...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|   ch2-DataTypes.ppt|      ppt|application/mspow...|application/vnd.m...|c74caee72b5ee6684...|f3bf878c775e2f72a...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|geographyofnortha...|      ppt|application/mspow...|application/vnd.m...|c35b93ac59f2eb5af...|b5de05a856838328c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       people1st.ppt|      ppt|application/mspow...|application/vnd.m...|bf19cdc1ff3ad82fd...|99f14fe81d8a9587f...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AD1-Reading.ppt|      ppt|application/mspow...|application/vnd.m...|be020b4564f972218...|0761a2fd5c176ce1c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         majalah.ppt|      ppt|application/mspow...|application/vnd.m...|b6f219693ef1df49f...|1039013624cf8de35...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the presentation program files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).presentationProgramFiles();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.presentation_program()
df.show()

Will extract all following information from presentation program files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|          wincvs.ppt|      ppt|application/mspow...|application/vnd.m...|52ac23b58493234b2...|a2206af9847cceb06...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           index.pps|      pps|application/mspow...|application/vnd.m...|fbaed5a1df163270a...|afa4c82593ea5bfd6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneCa...|      ppt|application/mspow...|application/vnd.m...|f5fde5813a5aef2f3...|e791212ac91243f39...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryone7t...|      ppt|application/mspow...|application/vnd.m...|9893643e1cb87af0c...|2fa8301893ad21b2b...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneGe...|      ppt|application/mspow...|application/vnd.m...|2a914a95a61b227dd...|5d783c1beaffc0b57...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryoneAl...|      ppt|application/mspow...|application/vnd.m...|c887f45fa58f273b0...|b253b732f8502f357...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|MathForEveryone7t...|      ppt|application/mspow...|application/vnd.m...|034906471a0c0b997...|16142a0aa69b2fb1f...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|           tiago.ppt|      ppt|application/mspow...|application/vnd.m...|6871786192c187783...|e5a91a65ef9a4bade...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        energypp.ppt|      ppt|application/mspow...|application/vnd.m...|94f9384ec57d8849c...|e943c5cf509f8f816...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            lego.pps|      pps|application/mspow...|application/vnd.m...|f0da5c58e7abbf102...|78bc45da68c6784be...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      celtiberos.pps|      pps|application/mspow...|application/vnd.m...|af897525acd31d359...|9c018a80253c38a57...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          porque.pps|      pps|application/mspow...|application/vnd.m...|9c2cba37c64fd0ac8...|6f11733ddec0abc2d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|SoftHandoffbyPara...|      pps|application/mspow...|application/vnd.m...|0c5ef732ea466574f...|dc7dfe545b401aeab...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|A_Land_Remembered...|      ppt|application/mspow...|application/vnd.m...|5b7273d03f8490490...|2d8721e7876cb6697...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           DANCE.ppt|      ppt|application/mspow...|application/vnd.m...|5aa3308433666a30a...|4a23bd20768501dac...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|            unit.ppt|      ppt|application/mspow...|application/vnd.m...|6736886864069ee66...|e92031e6e0293cb73...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         majalah.ppt|      ppt|application/mspow...|application/vnd.m...|b6f219693ef1df49f...|1039013624cf8de35...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|esos_si_son_probl...|      pps|application/mspow...|application/vnd.m...|932221045b6154d7e...|b23a0238c852d28bb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    refinterview.ppt|      ppt|application/mspow...|application/vnd.m...|ca6fd4ec5fcb8237d...|8312ca4c0dbeb6008...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Schwind.ppt|      ppt|application/mspow...|application/vnd.m...|cfdd4bb6e7b04f24a...|9c26a8ac091f88a35...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Spreadsheet Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).spreadsheets();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from spreadsheet files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|     statuscarib.xls|      xls|application/vnd.m...|application/vnd.m...|f9fd18b158df52ff2...|0d606f25ac3c9abc4...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       timesheet.xls|      xls|application/vnd.m...|application/vnd.m...|f9549db15de69bc21...|e9c239d812705842f...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|statusccusspring0...|      xls|application/vnd.m...|application/vnd.m...|ef99704e5a734f386...|f265fc5c581ad1762...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|  Laboratorio_05.xls|      xls|application/vnd.m...|application/vnd.m...|eb0e39898ba513234...|976f69da07122d285...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|110_Laboratorio_6...|      xls|application/vnd.m...|application/vnd.m...|e5b7fee6d4c45e171...|befd9670be70a4fdb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          Pakuan.xls|      xls|application/vnd.m...|application/vnd.m...|e386f85a7bd74b1ab...|5b2b142de2c57ec68...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Spring08_statusre...|      xls|application/vnd.m...|application/vnd.m...|df2d6792fb55c4e26...|6f4d2aef711aff4e1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|dc987d3e996677ce9...|40bb63a4c0038a6ae...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|MalibuTrailChalle...|      xls|application/vnd.m...|application/vnd.m...|dbba76ead82576178...|ffbe099441053b47b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     CTtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|d9ee9117e70df43b5...|596c4c6d5cdc7ddb5...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|  1071_Parcial_2.xls|      xls|application/vnd.m...|application/vnd.m...|d90dc210138676a2c...|6e3ed07f50393815c...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|excelsubtractione...|      xls|application/vnd.m...|application/vnd.m...|d6c8314e52f22e4aa...|1b1ebce0f85628921...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Fall2008_statusre...|      xls|application/vnd.m...|application/vnd.m...|cd9974430477b75ce...|0e756bbc38608cb51...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        report01.xls|      xls|application/vnd.m...|application/vnd.m...|cd947fe4099df4fe3...|0f11d17d38a72977b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|TrackRecords20010...|      xls|application/vnd.m...|application/vnd.m...|c8aa0122443efa0e5...|fa9cdb4a329f926bf...|0M8R4KGxGuEAAAAAA...|
|http://br.geociti...|  mycoinsforswap.xls|      xls|application/vnd.m...|application/vnd.m...|c665c83bc2b54292f...|18f1f3a4559d5c40a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AAtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|c66201762bf5e473e...|4e9bac4f217b0605d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|   carwashroster.xls|      xls|application/vnd.m...|application/vnd.m...|c495d1b7dc954b975...|062167485baf9aa5d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         RSL_MDP.xls|      xls|application/vnd.m...|application/vnd.m...|bf6479bacbb758b52...|4d7ea33849447853d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  status_report4.xls|      xls|application/vnd.m...|application/vnd.m...|bc4d18e022522d185...|fc7b9fc64116c9ad1...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the spreadsheet files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).spreadsheets();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.spreadsheets()
df.show()

Will extract all following information from spreadsheet files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|http://geocities....|        tkadrosu.xls|      xls|application/vnd.m...|application/vnd.m...|8033532f88da42ad6...|a52b24bc760c5265b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     cal_counter.xls|      xls|application/vnd.m...|application/vnd.m...|56ad6c2f84fdd4a88...|ad0db35f2d7ff2cca...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|dc987d3e996677ce9...|40bb63a4c0038a6ae...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     AAtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|c66201762bf5e473e...|4e9bac4f217b0605d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     CTtimetable.xls|      xls|application/vnd.m...|application/vnd.m...|d9ee9117e70df43b5...|596c4c6d5cdc7ddb5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    CTtimetable2.xls|      xls|application/vnd.m...|application/vnd.m...|a4ed4330d5c18f1b2...|d8ce479596d49679d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           teams.xls|      xls|application/vnd.m...|application/vnd.m...|334fa42776cef7f81...|aa57fda7fb634c931...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      collection.xls|      xls|application/vnd.m...|application/vnd.m...|30d7a67de8150f712...|841ba91f009d48b7a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|music-collection.xls|      xls|application/vnd.m...|application/vnd.m...|4def75fa96bae579d...|090a95923c9599454...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          020103.xls|      xls|application/vnd.m...|application/vnd.m...|48651a7592ca1b0f0...|1e2438c8247d33870...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          011803.xls|      xls|application/vnd.m...|application/vnd.m...|0aab8ed40f91c1c76...|8e02e408fe1ce40b9...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     RSL_TorOton.xls|      xls|application/vnd.m...|application/vnd.m...|1d9c13c6407a2b696...|007010ecf5b208453...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         members.xls|      xls|application/vnd.m...|application/vnd.m...|b045a6b118981c6eb...|3ae096d6602b7cb36...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|        round309.xls|      xls|application/vnd.m...|application/vnd.m...|50bed4b3e9facb278...|f26e0c38082141598...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|       result109.xls|      xls|application/vnd.m...|application/vnd.m...|2235d094897f10c3b...|6ed0b65fd43502a2b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|TrackRecords20010...|      xls|application/vnd.m...|application/vnd.m...|c8aa0122443efa0e5...|fa9cdb4a329f926bf...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|           Digox.xls|      xls|application/vnd.m...|application/vnd.m...|182d08821797269c7...|80e7ce8ecc1ecf389...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       RSL_SemBA.xls|      xls|application/vnd.m...|application/vnd.m...|59613700fbf08b795...|44eac99a514141520...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         RSL_MDP.xls|      xls|application/vnd.m...|application/vnd.m...|bf6479bacbb758b52...|4d7ea33849447853d...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       RSL_ARG99.xls|      xls|application/vnd.m...|application/vnd.m...|a2f2fd063dd5689a7...|61568e0f4139ec568...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Video Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).videos();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from videos in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://geocities....|       videohead.avi|      avi|     video/x-msvideo|video/x-msvideo|fa9852748ba7b4829...|0be56f200f8e1cb83...|UklGRjoMIQBBVkkgT...|
|http://www.geocit...|       HandWrap2.avi|      avi|     video/x-msvideo|video/x-msvideo|f680cb463e7cb291e...|1d2ea1df3f5af2599...|UklGRrBrAgBBVkkgT...|
|http://geocities....|         1kungfu.avi|      avi|     video/x-msvideo|video/x-msvideo|f4429277ed4b48efb...|5c542e8990efd484b...|UklGRkoSFwBBVkkgT...|
|http://geocities....|  Vol_III_sample.mpg|      mpg|          video/mpeg|     video/mpeg|f2bc34f7294edc376...|a939dc619c123f81b...|AAABuiEAAdLxgA7xA...|
|http://geocities....|         wherego.avi|      avi|     video/x-msvideo|video/x-msvideo|f23976ddeb6f08810...|714a9a548f9b2a156...|UklGRkq4HgBBVkkgT...|
|http://geocities....|       couch100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|ee316d5871acb7859...|0593ebb8e450a6c3e...|MCaydY5mzxGm2QCqA...|
|http://geocities....|    Mitwa_Lagaan.mp3|       qt|          audio/mpeg|video/quicktime|ebc5db8d30edd0135...|d3ebdd6da2c732481...|AAAE6W1vb3YAAABsb...|
|http://www.geocit...|       tydunking.mpg|      mpg|          video/mpeg|     video/mpeg|eaa0d14dc05bdab98...|05d4ff2301d2d3818...|AAABuiEAAQALgBcdA...|
|http://geocities....|       bigjleroy.avi|      avi|     video/x-msvideo|video/x-msvideo|e93538f0d76b86cca...|ebeb89fc2fa8f7cd6...|UklGRrjUCgBBVkkgT...|
|http://geocities....|     NollieBs180.mov|      mov|     video/quicktime|video/quicktime|e7b97c287329340d5...|138fb8b0dea4c8e16...|AAAGwm1vb3YAAABsb...|
|http://www.geocit...|           shirt.avi|      avi|     video/x-msvideo|video/x-msvideo|e36119d3c78225cbf...|11af72475ca754639...|UklGRvhdHQBBVkkgT...|
|http://geocities....|          atdawn.wma|      asf|      audio/x-ms-wma| video/x-ms-asf|e1a85a79ea3ba5d96...|1be05aecdff99298c...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|non_will_go_to_wa...|      mov|     video/quicktime|video/quicktime|de6cc975363c4076b...|0c00d0be9c89f9e97...|AAs4JW1kYXQAAA70A...|
|http://geocities....|      Movies_20.mpeg|     mpeg|          video/mpeg|     video/mpeg|dd9d2af0c1318b5ff...|9d06f09744fe93408...|AAABuiEAV+PlgAU7A...|
|http://www.geocit...|        artilery.mpg|      mpg|          video/mpeg|     video/mpeg|dcecbdfe46448bffb...|0b292aab1078d9bfa...|AAABswsAkBP//+CkA...|
|http://geocities....| tancfigurakbbpl.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|dca4991392572dbc0...|cb349bdc35484d976...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|          Trevi2.mov|      mov|     video/quicktime|video/quicktime|dc882205f5cae38f5...|c9dd804e1ee140221...|AAAEvG1vb3YAAAS0Y...|
|http://www.geocit...|skillful_driving_...|      mpg|          video/mpeg|     video/mpeg|db8a767b00884e426...|f5a70cf5f091b530f...|AAABuiEAAQABgAORA...|
|http://geocities....|      jeremy100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|dafba744438ae0110...|d3a217ce25507ae90...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|           mbrl2.mpg|      mpg|          video/mpeg|     video/mpeg|d8eb5a12f0da99ca0...|8686002a444cc9dce...|AAABswsAkBP//+CkA...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the video files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).videos();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.video()
df.show()

Will extract all following information from videos in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|http://www.geocit...|Sea_Dawgs_2008_Ha...|      wmv|      video/x-ms-wmv| video/x-ms-wmv|7b35e4cf60a3cfa67...|b35ad7242e8135326...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|       Excedrine.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|0aaf1d81ab6f2b354...|0b52af5f5facfd30f...|MCaydY5mzxGm2QCqA...|
|http://geocities....|        homework.avi|      avi|     video/x-msvideo|video/x-msvideo|4e06cbd11764cd2ac...|770a8849375965b20...|UklGRsrLAgBBVkkgT...|
|http://geocities....|    macarenababy.avi|      avi|     video/x-msvideo|video/x-msvideo|600084bbd732c0fda...|f99b2e31374d4ea18...|UklGRrC0AwBBVkkgT...|
|http://geocities....|orlando_viggokiss...|      wmv|      video/x-ms-asf| video/x-ms-wmv|79d093eb6184dba74...|395eaf6dcb29a66d2...|MCaydY5mzxGm2QCqA...|
|http://www.geocit...|skillful_driving_...|      mpg|          video/mpeg|     video/mpeg|db8a767b00884e426...|f5a70cf5f091b530f...|AAABuiEAAQABgAORA...|
|http://www.geocit...|gray_havens_2.35.MPG|      mpg|          video/mpeg|     video/mpeg|af71353d69af0b42f...|f3625b897339b0f23...|AAABuiEAAQABgAORA...|
|http://geocities....|         movie2.mpeg|     mpeg|          video/mpeg|     video/mpeg|3f6c7c48d2a990cf2...|760e6752bfd9e8a84...|AAABsxQA8MMCcSClE...|
|http://www.geocit...|       Sequence1.mov|      mov|     video/quicktime|video/quicktime|931fc4dee8aa260f9...|5a5cf58e2a50cf942...|AAELrG1vb3YAAABsb...|
|http://www.geocit...|           santa.mov|      mov|     video/quicktime|video/quicktime|8b9b98d0c567c4381...|49f49dd23c3bad61b...|AAAAIGZ0eXBxdCAgI...|
|http://geocities....|          0602-2.avi|      avi|     video/x-msvideo|video/x-msvideo|92d04dbe7f1bdc109...|65ed7327aece11bac...|UklGRkauOABBVkkgT...|
|http://geocities....|           movie.mpg|      mpg|          video/mpeg|     video/mpeg|a0e86539e5eb9bd35...|82eb4680a9f65ed1b...|AAABuiEAAdLxgASfA...|
|http://geocities....|     misshawaii.mpeg|     mpeg|          video/mpeg|     video/mpeg|45cbfc4d03547861b...|44c93f871ea602112...|AAABuiEAAQABgAORA...|
|http://geocities....|      Explosions.wmv|      wmv|      video/x-ms-wmv| video/x-ms-wmv|22cb24bffbd7eabf9...|a44d261ef5d7e7993...|MCaydY5mzxGm2QCqA...|
|http://geocities....|       couch100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|ee316d5871acb7859...|0593ebb8e450a6c3e...|MCaydY5mzxGm2QCqA...|
|http://geocities....|      jeremy100k.wmv|      asf|      video/x-ms-wmv| video/x-ms-asf|dafba744438ae0110...|d3a217ce25507ae90...|MCaydY5mzxGm2QCqA...|
|http://geocities....|       jedi_wade.mov|      mov|     video/quicktime|video/quicktime|674688fd09bf18d29...|cd21c3a5b9e2f18b6...|AAAFB21vb3YAAAT/Y...|
|http://geocities....|ylagallinanonosga...|      asf|      audio/x-ms-wma| video/x-ms-asf|9aac473134d7f2e7a...|3af7fbab238772f48...|MCaydY5mzxGm2QCqA...|
|http://geocities....|Chris-5050NollieS...|      mov|     video/quicktime|video/quicktime|93aa2ce07e01f90ad...|f066f29e5faf0cee1...|AAAHRG1vb3YAAABsb...|
|http://geocities....| floursack_jump2.avi|      avi|     video/x-msvideo|video/x-msvideo|a922441c0a7f0018d...|b82ca6fe1d46e16dc...|UklGRgjlAwBBVkkgT...|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
only showing top 20 rows

Extract Word Processor Files Information

Scala RDD

Will not be implemented.

Scala DF

The following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).wordProcessorFiles();

df.select($"url", $"filename", $"extension", $"mime_type_web_server", $"mime_type_tika", $"md5", $"sha1", $"bytes")
  .orderBy(desc("md5"))
  .show()

Will extract all following information from word processor files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|http://geocities....|infiniteproducts.doc|      doc|  application/msword|application/msword|ffa1ea83af6cb9508...|7a3ae86a7a22d2682...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      Everything.doc|      doc|  application/msword|application/msword|ff7216edf86fe196c...|082a889c27640fc9a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          survey.doc|      doc|  application/msword|application/msword|ff48df5e64bd5adeb...|383ab6ead48795ff3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....| iepWrkshpFall01.doc|      doc|  application/msword|application/msword|ff421feb87b826d39...|ec60a48d393642629...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|24_reproduction_s...|      doc|  application/msword|application/msword|fec21eb30fac4588e...|36b41ba66801b10b9...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|Descendit_ad_Infe...|      doc|  application/msword|application/msword|fe66eeb7c04942c8b...|14f207787abef983e...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Anthropology21FEB...|      doc|  application/msword|application/msword|fe079d498bd5e91f2...|ca54e6be7c0618ecc...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|          Senses.doc|      doc|  application/msword|application/msword|fdf881ef998c227f7...|04d6e72132537053a...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|hopewel-loudon-cl...|      doc|  application/msword|application/msword|fddffbabcaf1976c9...|b7ade5d661dd597a1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|   billmprev9899.doc|      doc|  application/msword|application/msword|fdcc8b65cfb0a18c9...|602f323278c9fb726...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|approachesProject...|      doc|  application/msword|application/msword|fd4df7f89efe9cea7...|4e7be7664bfe992f3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         batayan.doc|      doc|  application/msword|application/msword|fc6f45fdfce72d4a3...|e614c9b9e95d64aa6...|0M8R4KGxGuEAAAAAA...|
|http://geocities....| VisitUnitPacket.doc|      doc|  application/msword|application/msword|fc2a0e45b627c3d4a...|dc7ba874b7b13d548...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|vc3c3ppstudyguide...|      doc|  application/msword|application/msword|fc293bbddb906615f...|538aa0d5e2f554258...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|30_chordates_fish...|      doc|  application/msword|application/msword|fc053770a82822f69...|9df86863983889373...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|c6artposterexampl...|      doc|  application/msword|application/msword|fbe2427b48f32d1d9...|47de792202dc3a059...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|        kun20509.doc|      doc|  application/msword|application/msword|fb8d1ae5e3db45131...|6b13d73759a956e62...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|            kun20509|      doc|  application/msword|application/msword|fb8d1ae5e3db45131...|6b13d73759a956e62...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|         Fishing.doc|      doc|  application/msword|application/msword|fb7df7ac80aa2cc8a...|eb4bb266226349bac...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|      resumedoAw.doc|      doc|  application/msword|application/msword|fb6d5bf501b9b97b3...|1e0d6500192d4ee21...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
only showing top 20 rows

import io.archivesunleashed._
import io.archivesunleashed.udfs._
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]

If you wanted to work with all the word processor files in a collection, you could extract them with the following script:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val df = RecordLoader.loadArchives("/path/to/warcs", sc).wordProcessorFiles();

df.select($"bytes", $"extension")
  .saveToDisk("bytes", "/path/to/export/directory/your-preferred-filename-prefix", $"extension")

Python DF

The following script:

from aut import *

archive = WebArchive(sc, sqlContext, "/path/to/warcs")

df = archive.word_processor()
df.show()

Will extract all following information from word processor files in a web collection:

file url
filename
extension
MimeType as identified by the hosting web server
MimeType as identified by Apache Tika
md5 hash
sha1 hash
bytes

+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|http://geocities....|            Doc2.doc|      doc|  application/msword|application/msword|09159efbefff59f64...|5412d6c55c2c8bec7...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-ITjob.doc|      doc|  application/msword|application/msword|7f2b7540e558de24e...|96a6ece7202ab309b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-Teach.doc|      doc|  application/msword|application/msword|637bb22eff4bc5be5...|76130b6ffeac5c678...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|       CV-covlet.doc|      doc|  application/msword|application/msword|466c06bfa5a47d5cb...|dc763126cbdb589eb...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        CV-extra.doc|      doc|  application/msword|application/msword|ab0fa931229c02a4b...|4c2a8200e6eaaafb2...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|020410Indonesia_N...|      doc|  application/msword|application/msword|b195e90841347be61...|6d2845902ad15a9a2...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        Chapter1.doc|      doc|  application/msword|application/msword|65383c8c0cf5b6a4f...|fcf3008e9478b773c...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|CathyKoning_resum...|      doc|  application/msword|application/msword|924ad3f9f66d3c6bd...|2d0887c93ffd3e78b...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  Greek_colonels.doc|      doc|  application/msword|application/msword|ee4b9db827086d0db...|94e5569e064195db5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|         resume2.doc|      doc|  application/msword|application/msword|c39fa601733093268...|108563de6ba6102a5...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     eta_writeup.doc|      doc|  application/msword|application/msword|661328d76ce3aa340...|debadb248da4dfbd3...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Before_Night_Fall...|      doc|  application/msword|application/msword|a40371b35b4bf0838...|8f1dba8a46ea297b8...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|Membership_Form_2...|      doc|  application/msword|application/msword|bf3a3b8cc86b371c3...|472810e93a2245fb1...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|  walkthroughff1.doc|      doc|  application/msword|application/msword|c97de6941c3fb4aed...|16851a5445bdce07d...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|    Encyclopedia.doc|      doc|  application/msword|application/msword|26a94e8f3358c878c...|07f9b2ce6342f73bc...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|        Y.Kurulu.doc|      doc|  application/msword|application/msword|8e0ebe7c4f27b1841...|ebb5ce328f717f8e6...|0M8R4KGxGuEAAAAAA...|
|http://www.geocit...|      fifty_eggs.doc|      doc|  application/msword|application/msword|2c1cdd4f75030650e...|d022311b2fc399750...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     1pitagoras2.doc|      doc|  application/msword|application/msword|e07ff47cb8ebc4356...|97d46d781458f5a82...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|    constitution.doc|      doc|  application/msword|application/msword|e38dc3e5d553d8799...|d50096b5208146ce9...|0M8R4KGxGuEAAAAAA...|
|http://geocities....|     feasibility.doc|      doc|  application/msword|application/msword|5574bf82d65935191...|53de74880c9ea2e2b...|0M8R4KGxGuEAAAAAA...|
+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
only showing top 20 rows