Text Files (html, text, css, js, json, xml) Analysis
Extract CSS Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).css();
df.show()
Will extract all following information from css files in a web collection:
- crawl date
- last modified date
- css url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|         filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143356|    20081201152149|http://geocities....|     will1011.css|      css|            text/css|    text/plain|697e9a984ec5432f0...|ac2343fc128b90c22...|.mstheme{\r\nnav-...|
|20091027143400|    20021203112405|http://geocities....|affl_flmstyle.css|      css|            text/css|    text/plain|c3b976fe4d295f76e...|171967d23d5a44434...|td { font-family:...|
|20091027143406|    20040120180918|http://geocities....|        myCss.css|      css|            text/css|    text/plain|6dc7da3d87cd15674...|b0a3eeb60a809527e...|/* Generic Select...|
|20091027143408|    20040711165229|http://geocities....|         def2.css|      css|            text/css|    text/plain|806da7dcf79931c9e...|14ea1e704ce84f578...|body\r\n{\r\n\tsc...|
|20091027143410|    20030613194405|http://geocities....|    uni1style.css|      css|            text/css|    text/plain|ff2d35d5548169924...|d961c786665e69c30...|body { background...|
|20091027143431|    20090406121936|http://geocities....|        style.css|      css|            text/css|    text/plain|60a5a8cb4f9694179...|452a40545b570b043...|/*\nTheme  ...|
|20091027143453|    20010320155440|http://geocities....|        theme.css|      css|            text/css|    text/plain|afd79d6ebc4918d84...|233fd343a931a8a6b...|.mstheme\r\n{\r\n...|
|20091027143503|    20010320155430|http://geocities....|       color0.css|      css|            text/css|    text/plain|c237436a24f67c96c...|806351cc2fb654fc7...|a:link\r\n{\r\n\t...|
|20091027143511|    20010320155438|http://geocities....|       graph1.css|      css|            text/css|    text/plain|2d3bd2eed7b7290fc...|de4c2c0dc23d5d40d...|.mstheme\r\n{\r\n...|
|20091027143512|    20010320155436|http://geocities....|       graph0.css|      css|            text/css|    text/plain|af18d7c1ab29918e7...|78b3f781992894c9f...|.mstheme\r\n{\r\n...|
|20091027143540|    20000503224221|http://geocities....|       graph1.css|      css|            text/css|    text/plain|d67df8c9f7b5ff787...|338fa4a9a3d7174ef...|.mstheme\r\n{\r\n...|
|20091027143545|    20000503224217|http://geocities....|       color1.css|      css|            text/css|    text/plain|58f313e384d212b71...|193f89e84b25d1614...|a:link\r\n{\r\n\t...|
|20091027143551|    20000503224220|http://geocities....|       graph0.css|      css|            text/css|    text/plain|f5a58785538278992...|109cc3f90e40c66d2...|.mstheme\r\n{\r\n...|
|20091027143554|    20010824074320|http://geocities....|      formate.css|      css|            text/css|    text/plain|be7d072735ad829cf...|a5a0ea5aaf1404713...|h1 { font-size: 1...|
|20091027143600|    20030221224917|http://geocities....|        misc1.css|      css|            text/css|    text/plain|5852d4b0ed5191e47...|f73a37079de2987f6...|body {background:...|
|20091027143659|    20030119042931|http://geocities....|       census.css|      css|            text/css|    text/plain|2eb62774ed251df55...|f6165cc47bd8c46b9...|/* At-Rules */\r\...|
|20091027143721|    20010606220428|http://geocities....|   hauptseite.css|      css|            text/css|    text/plain|87848cebf5ac11eb8...|9948813069d63a0d0...|body {\r\n backgr...|
|20091027143755|    20040316190038|http://geocities....|        style.css|      css|            text/css|    text/plain|b47758ca22799bc7c...|994ecd200c1bfa6c5...|blockquote,div,p,...|
|20091027143757|    20020918082310|http://geocities....|    nomburdua.css|      css|            text/css|    text/plain|c6fe67b54b78b633f...|4dd03720d85f2f982...|.ranti {  font-fa...|
|20091027143842|    20040703142500|http://geocities....|         main.css|      css|            text/css|    text/plain|f7582c9838bebc55b...|cb6e02313c6a5bc99...|.commands {\r\n\t...|
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.css()
df.show()
Will extract all following information from css files in a web collection:
- crawl date
- last modified date
- css url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|         filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143356|    20081201152149|http://geocities....|     will1011.css|      css|            text/css|    text/plain|697e9a984ec5432f0...|ac2343fc128b90c22...|.mstheme{\r\nnav-...|
|20091027143400|    20021203112405|http://geocities....|affl_flmstyle.css|      css|            text/css|    text/plain|c3b976fe4d295f76e...|171967d23d5a44434...|td { font-family:...|
|20091027143406|    20040120180918|http://geocities....|        myCss.css|      css|            text/css|    text/plain|6dc7da3d87cd15674...|b0a3eeb60a809527e...|/* Generic Select...|
|20091027143408|    20040711165229|http://geocities....|         def2.css|      css|            text/css|    text/plain|806da7dcf79931c9e...|14ea1e704ce84f578...|body\r\n{\r\n\tsc...|
|20091027143410|    20030613194405|http://geocities....|    uni1style.css|      css|            text/css|    text/plain|ff2d35d5548169924...|d961c786665e69c30...|body { background...|
|20091027143431|    20090406121936|http://geocities....|        style.css|      css|            text/css|    text/plain|60a5a8cb4f9694179...|452a40545b570b043...|/*\nTheme  ...|
|20091027143453|    20010320155440|http://geocities....|        theme.css|      css|            text/css|    text/plain|afd79d6ebc4918d84...|233fd343a931a8a6b...|.mstheme\r\n{\r\n...|
|20091027143503|    20010320155430|http://geocities....|       color0.css|      css|            text/css|    text/plain|c237436a24f67c96c...|806351cc2fb654fc7...|a:link\r\n{\r\n\t...|
|20091027143511|    20010320155438|http://geocities....|       graph1.css|      css|            text/css|    text/plain|2d3bd2eed7b7290fc...|de4c2c0dc23d5d40d...|.mstheme\r\n{\r\n...|
|20091027143512|    20010320155436|http://geocities....|       graph0.css|      css|            text/css|    text/plain|af18d7c1ab29918e7...|78b3f781992894c9f...|.mstheme\r\n{\r\n...|
|20091027143540|    20000503224221|http://geocities....|       graph1.css|      css|            text/css|    text/plain|d67df8c9f7b5ff787...|338fa4a9a3d7174ef...|.mstheme\r\n{\r\n...|
|20091027143545|    20000503224217|http://geocities....|       color1.css|      css|            text/css|    text/plain|58f313e384d212b71...|193f89e84b25d1614...|a:link\r\n{\r\n\t...|
|20091027143551|    20000503224220|http://geocities....|       graph0.css|      css|            text/css|    text/plain|f5a58785538278992...|109cc3f90e40c66d2...|.mstheme\r\n{\r\n...|
|20091027143554|    20010824074320|http://geocities....|      formate.css|      css|            text/css|    text/plain|be7d072735ad829cf...|a5a0ea5aaf1404713...|h1 { font-size: 1...|
|20091027143600|    20030221224917|http://geocities....|        misc1.css|      css|            text/css|    text/plain|5852d4b0ed5191e47...|f73a37079de2987f6...|body {background:...|
|20091027143659|    20030119042931|http://geocities....|       census.css|      css|            text/css|    text/plain|2eb62774ed251df55...|f6165cc47bd8c46b9...|/* At-Rules */\r\...|
|20091027143721|    20010606220428|http://geocities....|   hauptseite.css|      css|            text/css|    text/plain|87848cebf5ac11eb8...|9948813069d63a0d0...|body {\r\n backgr...|
|20091027143755|    20040316190038|http://geocities....|        style.css|      css|            text/css|    text/plain|b47758ca22799bc7c...|994ecd200c1bfa6c5...|blockquote,div,p,...|
|20091027143757|    20020918082310|http://geocities....|    nomburdua.css|      css|            text/css|    text/plain|c6fe67b54b78b633f...|4dd03720d85f2f982...|.ranti {  font-fa...|
|20091027143842|    20040703142500|http://geocities....|         main.css|      css|            text/css|    text/plain|f7582c9838bebc55b...|cb6e02313c6a5bc99...|.commands {\r\n\t...|
+--------------+------------------+--------------------+-----------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Extract HTML Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).html();
df.show()
Will extract all following information from HTML files in a web collection:
- crawl date
- last modified date
- html url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|          filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143351|    20000807145505|http://geocities....|        sld016.htm|      htm|           text/html|    text/plain|127471f3dffa3b0fe...|1c6e3cf8a95ff488f...|\r\n<!--  Present...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|8529ca97e911200cd...|413009a3dd18247d9...|<!DOCTYPE HTML PU...|
|20091027143351|    20010418111505|http://geocities....|           07.html|     html|           text/html|     text/html|2ed0eb187604be1b5...|879ac81e08cde20df...|<html>\r\n<head>\...|
|20091027143351|    20090223023317|http://geocities....|       webpgs.html|     html|           text/html|     text/html|647f21084b24b4413...|d3bceb1e246e883d6...|<HTML><HEAD><TITL...|
|20091027143351|    19990606173754|http://www.geocit...|         ories.htm|      htm|           text/html|     text/html|c97248fb58f5471fe...|ee0935ab18ad6e484...|<!DOCTYPE HTML PU...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|e097ef39e1b8808cb...|4c2488eb2555d7d47...|<!DOCTYPE HTML PU...|
|20091027143351|    20030916023140|http://www.geocit...|    Lienhardt.html|     html|           text/html|     text/html|e8e63dd31072b9fab...|cacc7f50003470cbe...|<html>\n<head>\n<...|
|20091027143351|    20011230150100|http://geocities....|HeartsDelight.html|     html|           text/html|     text/html|367899cf7951c8b71...|828e90996c75b0e8c...|<html>\r\n<head>\...|
|20091027143351|    20010418112637|http://geocities....|           13.html|     html|           text/html|     text/html|73512de4bd8a74d8e...|4911b7b2b6644d96d...|<html>\r\n<head>\...|
|20091027143351|    20090223023317|http://geocities....|          man2.jpg|     html|           text/html|     text/html|647f21084b24b4413...|d3bceb1e246e883d6...|<HTML><HEAD><TITL...|
|20091027143346|                  |http://geocities....|                  |     html|           text/html|     text/html|7f873717f2fd67b8a...|c0ca5f1ae47e57ba6...|<!DOCTYPE HTML PU...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|0b0effa7a9b9ddb7b...|0e2761aabcd497698...|<!DOCTYPE HTML PU...|
|20091027143352|    20000807145520|http://geocities....|        sld014.htm|      htm|           text/html|    text/plain|6efe8c845e4766ca1...|6aa94ff9dee0a6728...|\r\n<!--  Present...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|876a360d146635039...|f6b2ede1616aa2baa...|<!DOCTYPE HTML PU...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|7d6c71d92d1682923...|30c30cdaf888390b9...|<!DOCTYPE HTML PU...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|159df7ea020a07600...|7b7e5bc7a6c1c871b...|<!DOCTYPE HTML PU...|
|20091027143352|    20000807145500|http://geocities....|        sld012.htm|      htm|           text/html|    text/plain|1714d4cb34af991a8...|58289c263e392f1dd...|\r\n<!--  Present...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|bbbfe9f8c5fa52d56...|e753a8e24518c0db4...|<!DOCTYPE HTML PU...|
|20091027143356|    19990606173419|http://www.geocit...|          maps.htm|      htm|           text/html|     text/html|28d2c8b2ffdd85f43...|9931f86198bebc831...|<!DOCTYPE HTML PU...|
|20091027143356|    20040119230348|http://geocities....|    3368class.html|     html|           text/html|     text/html|0ed6f056e7996ee62...|caaf8a4f17dab8932...|<!DOCTYPE HTML PU...|
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.html()
df.show()
Will extract all following information from HTML files in a web collection:
- crawl date
- last modified date
- html url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|          filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143351|    20000807145505|http://geocities....|        sld016.htm|      htm|           text/html|    text/plain|127471f3dffa3b0fe...|1c6e3cf8a95ff488f...|\r\n<!--  Present...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|8529ca97e911200cd...|413009a3dd18247d9...|<!DOCTYPE HTML PU...|
|20091027143351|    20010418111505|http://geocities....|           07.html|     html|           text/html|     text/html|2ed0eb187604be1b5...|879ac81e08cde20df...|<html>\r\n<head>\...|
|20091027143351|    20090223023317|http://geocities....|       webpgs.html|     html|           text/html|     text/html|647f21084b24b4413...|d3bceb1e246e883d6...|<HTML><HEAD><TITL...|
|20091027143351|    19990606173754|http://www.geocit...|         ories.htm|      htm|           text/html|     text/html|c97248fb58f5471fe...|ee0935ab18ad6e484...|<!DOCTYPE HTML PU...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|e097ef39e1b8808cb...|4c2488eb2555d7d47...|<!DOCTYPE HTML PU...|
|20091027143351|    20030916023140|http://www.geocit...|    Lienhardt.html|     html|           text/html|     text/html|e8e63dd31072b9fab...|cacc7f50003470cbe...|<html>\n<head>\n<...|
|20091027143351|    20011230150100|http://geocities....|HeartsDelight.html|     html|           text/html|     text/html|367899cf7951c8b71...|828e90996c75b0e8c...|<html>\r\n<head>\...|
|20091027143351|    20010418112637|http://geocities....|           13.html|     html|           text/html|     text/html|73512de4bd8a74d8e...|4911b7b2b6644d96d...|<html>\r\n<head>\...|
|20091027143351|    20090223023317|http://geocities....|          man2.jpg|     html|           text/html|     text/html|647f21084b24b4413...|d3bceb1e246e883d6...|<HTML><HEAD><TITL...|
|20091027143346|                  |http://geocities....|                  |     html|           text/html|     text/html|7f873717f2fd67b8a...|c0ca5f1ae47e57ba6...|<!DOCTYPE HTML PU...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|0b0effa7a9b9ddb7b...|0e2761aabcd497698...|<!DOCTYPE HTML PU...|
|20091027143352|    20000807145520|http://geocities....|        sld014.htm|      htm|           text/html|    text/plain|6efe8c845e4766ca1...|6aa94ff9dee0a6728...|\r\n<!--  Present...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|876a360d146635039...|f6b2ede1616aa2baa...|<!DOCTYPE HTML PU...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|7d6c71d92d1682923...|30c30cdaf888390b9...|<!DOCTYPE HTML PU...|
|20091027143352|                  |http://geocities....|                  |     html|           text/html|     text/html|159df7ea020a07600...|7b7e5bc7a6c1c871b...|<!DOCTYPE HTML PU...|
|20091027143352|    20000807145500|http://geocities....|        sld012.htm|      htm|           text/html|    text/plain|1714d4cb34af991a8...|58289c263e392f1dd...|\r\n<!--  Present...|
|20091027143351|                  |http://geocities....|                  |     html|           text/html|     text/html|bbbfe9f8c5fa52d56...|e753a8e24518c0db4...|<!DOCTYPE HTML PU...|
|20091027143356|    19990606173419|http://www.geocit...|          maps.htm|      htm|           text/html|     text/html|28d2c8b2ffdd85f43...|9931f86198bebc831...|<!DOCTYPE HTML PU...|
|20091027143356|    20040119230348|http://geocities....|    3368class.html|     html|           text/html|     text/html|0ed6f056e7996ee62...|caaf8a4f17dab8932...|<!DOCTYPE HTML PU...|
+--------------+------------------+--------------------+------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Extract Javascript Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).js();
df.show()
Will extract all following information from Javascript files in a web collection:
- crawl date
- last modified date
- Javascript url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143359|    20021220182656|http://geocities....|affl_2002teams_ve...|       js|application/x-jav...|    text/plain|ecc947cf41560248f...|e200d0cc8d167e588...|//  Copyright (c)...|
|20091027143359|    20021220182656|http://geocities....|affl_2002menu_ver...|       js|application/x-jav...|    text/plain|9ed4199ccb9aadbc6...|ade6e33984a571b4f...|//  Copyright (c)...|
|20091027143359|    20021220182657|http://geocities....|affl_2002weeks_ho...|       js|application/x-jav...|    text/plain|e1d530ba9a3113f76...|21fe28b0bb00fc2ea...|//  Copyright (c)...|
|20091027143401|    20071025193708|http://geocities....|          effects.js|       js|application/x-jav...|    text/plain|82e25a810f86d3b8c...|21ce51daa693e3716...|// Copyright (c) ...|
|20091027143402|    20071025193708|http://geocities....|    scriptaculous.js|       js|application/x-jav...|    text/plain|696bd054b0069b607...|914db330c7fe585df...|// Copyright (c) ...|
|20091027143431|    20090406122000|http://geocities....|              oea.js|       js|application/x-jav...|    text/plain|500ceaa723d95be31...|6185b986af821a054...|addComment={moveF...|
|20091027143433|    20090406122000|http://geocities....|             shot.js|       js|application/x-jav...|    text/plain|cc408c7eba68a6378...|23545b737b19f34c7...|//<!--\n/*! Snap ...|
|20091027143502|    20010925080657|http://geocities....|           bubble.js|       js|application/x-jav...|    text/plain|b11424d14656b3e5e...|36655a71af25d6360...|//+--------------...|
|20091027143509|    20010925101649|http://geocities....|          puzzlex.js|       js|application/x-jav...|    text/plain|0fddc85095cc2476f...|74be255b206285af2...|// ------ this fu...|
|20091027143724|    20000906162714|http://geocities....|            geov2.js|       js|application/x-jav...|    text/plain|26821d2d7f896da03...|29c53eaed516ff43c...|var ycsdone;\nfun...|
|20091027143726|    20090329164941|http://geocities....|             shot.js|       js|application/x-jav...|    text/plain|cc408c7eba68a6378...|23545b737b19f34c7...|//<!--\n/*! Snap ...|
|20091027143728|    20090329164941|http://geocities....|              yre.js|       js|application/x-jav...|    text/plain|500ceaa723d95be31...|6185b986af821a054...|addComment={moveF...|
|20091027143806|    20061210200851|http://geocities....|ActiveContent_Fla...|       js|application/x-jav...|    text/plain|7a1ee205b2dea3f29...|5c52c1be91092b6ef...|/*\r\nIE Flash Ac...|
|20091027143809|    20080407182913|http://geocities....|ActiveContent_Fla...|       js|application/x-jav...|    text/plain|7a1ee205b2dea3f29...|5c52c1be91092b6ef...|/*\r\nIE Flash Ac...|
|20091027143855|    20010303131242|http://www.geocit...|    VMaxDynoMenu3.js|       js|application/x-jav...|    text/plain|029fd2e9f9d2973c3...|c30650399cfc15e8e...|/****************...|
|20091027143908|    20020404031834|http://www.geocit...|vmaxformvalidator.js|       js|application/x-jav...|    text/plain|359e439b13a5fcfd6...|db3db85d75f1086f2...|/****************...|
|20091027143921|    20071031154845|http://geocities....|        op7-build.js|       js|application/x-jav...|    text/plain|c106186f7f04a432e...|082169240fa1a1927...|var docType = (do...|
|20091027143922|    20071031154845|http://geocities....|        saf-build.js|       js|application/x-jav...|    text/plain|93f06ba888f3ed236...|05c4966a8cd488a33...|var docType = (do...|
|20091027143923|    20071031154845|http://geocities....|       ie5m-build.js|       js|application/x-jav...|    text/plain|584a5769d7aa1a80e...|1e17f55615d38cbed...|var docType = (do...|
|20091027143924|    20071031154845|http://geocities....|        ns4-build.js|       js|application/x-jav...|    text/plain|0f7a424f607eba6ab...|3d184d00c2f568bfa...|window.onresize =...|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.js()
df.show()
Will extract all following information from Javascript files in a web collection:
- crawl date
- last modified date
- Javascript url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143359|    20021220182656|http://geocities....|affl_2002teams_ve...|       js|application/x-jav...|    text/plain|ecc947cf41560248f...|e200d0cc8d167e588...|//  Copyright (c)...|
|20091027143359|    20021220182656|http://geocities....|affl_2002menu_ver...|       js|application/x-jav...|    text/plain|9ed4199ccb9aadbc6...|ade6e33984a571b4f...|//  Copyright (c)...|
|20091027143359|    20021220182657|http://geocities....|affl_2002weeks_ho...|       js|application/x-jav...|    text/plain|e1d530ba9a3113f76...|21fe28b0bb00fc2ea...|//  Copyright (c)...|
|20091027143401|    20071025193708|http://geocities....|          effects.js|       js|application/x-jav...|    text/plain|82e25a810f86d3b8c...|21ce51daa693e3716...|// Copyright (c) ...|
|20091027143402|    20071025193708|http://geocities....|    scriptaculous.js|       js|application/x-jav...|    text/plain|696bd054b0069b607...|914db330c7fe585df...|// Copyright (c) ...|
|20091027143431|    20090406122000|http://geocities....|              oea.js|       js|application/x-jav...|    text/plain|500ceaa723d95be31...|6185b986af821a054...|addComment={moveF...|
|20091027143433|    20090406122000|http://geocities....|             shot.js|       js|application/x-jav...|    text/plain|cc408c7eba68a6378...|23545b737b19f34c7...|//<!--\n/*! Snap ...|
|20091027143502|    20010925080657|http://geocities....|           bubble.js|       js|application/x-jav...|    text/plain|b11424d14656b3e5e...|36655a71af25d6360...|//+--------------...|
|20091027143509|    20010925101649|http://geocities....|          puzzlex.js|       js|application/x-jav...|    text/plain|0fddc85095cc2476f...|74be255b206285af2...|// ------ this fu...|
|20091027143724|    20000906162714|http://geocities....|            geov2.js|       js|application/x-jav...|    text/plain|26821d2d7f896da03...|29c53eaed516ff43c...|var ycsdone;\nfun...|
|20091027143726|    20090329164941|http://geocities....|             shot.js|       js|application/x-jav...|    text/plain|cc408c7eba68a6378...|23545b737b19f34c7...|//<!--\n/*! Snap ...|
|20091027143728|    20090329164941|http://geocities....|              yre.js|       js|application/x-jav...|    text/plain|500ceaa723d95be31...|6185b986af821a054...|addComment={moveF...|
|20091027143806|    20061210200851|http://geocities....|ActiveContent_Fla...|       js|application/x-jav...|    text/plain|7a1ee205b2dea3f29...|5c52c1be91092b6ef...|/*\r\nIE Flash Ac...|
|20091027143809|    20080407182913|http://geocities....|ActiveContent_Fla...|       js|application/x-jav...|    text/plain|7a1ee205b2dea3f29...|5c52c1be91092b6ef...|/*\r\nIE Flash Ac...|
|20091027143855|    20010303131242|http://www.geocit...|    VMaxDynoMenu3.js|       js|application/x-jav...|    text/plain|029fd2e9f9d2973c3...|c30650399cfc15e8e...|/****************...|
|20091027143908|    20020404031834|http://www.geocit...|vmaxformvalidator.js|       js|application/x-jav...|    text/plain|359e439b13a5fcfd6...|db3db85d75f1086f2...|/****************...|
|20091027143921|    20071031154845|http://geocities....|        op7-build.js|       js|application/x-jav...|    text/plain|c106186f7f04a432e...|082169240fa1a1927...|var docType = (do...|
|20091027143922|    20071031154845|http://geocities....|        saf-build.js|       js|application/x-jav...|    text/plain|93f06ba888f3ed236...|05c4966a8cd488a33...|var docType = (do...|
|20091027143923|    20071031154845|http://geocities....|       ie5m-build.js|       js|application/x-jav...|    text/plain|584a5769d7aa1a80e...|1e17f55615d38cbed...|var docType = (do...|
|20091027143924|    20071031154845|http://geocities....|        ns4-build.js|       js|application/x-jav...|    text/plain|0f7a424f607eba6ab...|3d184d00c2f568bfa...|window.onresize =...|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Extract JSON Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).json();
df.show()
Will extract all following information from JSON files in a web collection:
- crawl date
- last modified date
- JSON url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20200629190008|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190022|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190148|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190152|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190302|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200817190116|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|15c31f1d6321af994...|9fe80f13837add909...|{"_id":"5f3a984bc...|
|20200817190119|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|05a4b7753855bdc62...|6fe1660d9664c3b78...|{"_id":"5f3a984bc...|
|20200721190037|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721190451|                  |https://www.toron...|                    |     json|    application/json|    text/plain|ec308cb0f98033491...|694def805e20eb826...|[{"answer":"Monit...|
|20200721191757|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721191846|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721191858|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200808190252|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|08816f4b398bb1228...|78e73437f1b8085ab...|{"_id":"5f2ec105c...|
|20200808190257|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|554d765096250e4e4...|861c69ee9c671c716...|{"_id":"5f2ec105c...|
|20200721191749|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|622058a03b029934f...|2fd5dbdf3c51678d0...|{"_id":"5f16ff51c...|
|20200721191751|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|fe91df263ed26a5cb...|e59d5e0eb0eadff05...|{"_id":"5f16ff51c...|
|20200716190031|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200716190055|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200716190605|                  |https://www.toron...|                    |     json|    application/json|    text/plain|8d6accf32259c10e4...|b28ce854dee0e169e...|[{"answer":"Monit...|
|20200716190702|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.json()
df.show()
Will extract all following information from JSON files in a web collection:
- crawl date
- last modified date
- JSON url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20200629190008|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190022|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190148|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190152|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200629190302|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200817190116|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|15c31f1d6321af994...|9fe80f13837add909...|{"_id":"5f3a984bc...|
|20200817190119|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|05a4b7753855bdc62...|6fe1660d9664c3b78...|{"_id":"5f3a984bc...|
|20200721190037|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721190451|                  |https://www.toron...|                    |     json|    application/json|    text/plain|ec308cb0f98033491...|694def805e20eb826...|[{"answer":"Monit...|
|20200721191757|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721191846|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200721191858|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200808190252|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|08816f4b398bb1228...|78e73437f1b8085ab...|{"_id":"5f2ec105c...|
|20200808190257|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|554d765096250e4e4...|861c69ee9c671c716...|{"_id":"5f2ec105c...|
|20200721191749|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|622058a03b029934f...|2fd5dbdf3c51678d0...|{"_id":"5f16ff51c...|
|20200721191751|                  |https://api.ontar...|page%2Fhow-ontari...|     json|    application/json|    text/plain|fe91df263ed26a5cb...|e59d5e0eb0eadff05...|{"_id":"5f16ff51c...|
|20200716190031|                  |https://c.oraclei...|          robots.txt|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200716190055|                  |https://map.toron...|findAddressCandid...|     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
|20200716190605|                  |https://www.toron...|                    |     json|    application/json|    text/plain|8d6accf32259c10e4...|b28ce854dee0e169e...|[{"answer":"Monit...|
|20200716190702|                  |https://www.toron...|                    |     json|    application/json|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|                    |
+--------------+------------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Extract Plain Text Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).plainText();
df.show()
Will extract all following information from plain text files in a web collection:
- crawl date
- last modified date
- Plain text url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|     filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143357|    20050624112247|http://geocities....|      100.txt|      txt|          text/plain|    text/plain|d551b4e1341fedeb1...|6da698a82e6f2cd86...|Checklist With Pe...|
|20091027143453|    20030614122143|http://geocities....| Password.pas|      txt|          text/plain|    text/plain|af14db1afe4e91655...|94639d5ae81de1829...|michael\r\njacob\...|
|20091027143503|    20081106140745|http://www.knatte...|   robots.txt|      txt|          text/plain|    text/plain|48ca1fcf2991ae97e...|a2a115de2795276ab...|# parking, see cv...|
|20091027143514|    20071122094558|http://geocities....| plansite.txt|      txt|          text/plain|    text/plain|e6bda66886cf51a42...|70514bd62a54738dc...|Index.html\r\n\r\...|
|20091027143519|    20070628091530|http://geocities....|  xmldata.txt|      txt|          text/plain|    text/plain|3c66ec8b796fd012a...|934ba89998a9f9089...|set xmlDoc=Create...|
|20091027143530|    20090310131332|http://geocities....|Manhattan.txt|      txt|          text/plain|    text/plain|33bec040c2319f50d...|8b78c2a100dc13687...|Endwich Bank \r\n...|
|20091027143604|    20030802211911|http://geocities....|   WS_FTP.LOG|      txt|          text/plain|    text/plain|b9132d6f9170ece38...|0b66844444a894955...|2003.08.02 03:01 ...|
|20091027143610|    20040929172044|http://geocities....|1-teras-1.txt|      txt|          text/plain|    text/plain|ca1cdfd1b1ae53d02...|2e9d01175b3b549f1...|KASIH FOTO DJAKA ...|
|20091027143610|    20040929172044|http://geocities....|1-kanan-1.txt|      txt|          text/plain|    text/plain|516ed33a73a93f41c...|c7c9ff394e51e8be2...|Jaringan Azhari \...|
|20091027143612|    20040929172255|http://geocities....|    2open.txt|      txt|          text/plain|    text/plain|3e0f80d7b1a3bf060...|c681bd52d84e66a68...|Banteng Boyolali ...|
|20091027143612|    20040929172044|http://geocities....|1-teras-2.txt|      txt|          text/plain|    text/plain|a5a0abbe4138622a3...|61aa0c8a2455d1149...|Targetnya Ngerem ...|
|20091027143613|    20050401210115|http://geocities....|  scans85.txt|      txt|          text/plain|    text/plain|d1387c7eaf4377f02...|b2a18ea8434a34d94...|Sector Spy Report...|
|20091027143613|    20040929172044|http://geocities....|    1-box.txt|      txt|          text/plain|    text/plain|3105034b810cc7fbe...|d99c38f517cab35ab...|Wali Kota Slamet ...|
|20091027143614|    20040929172044|http://geocities....| 1-banner.txt|      txt|          text/plain|    text/plain|210378da87dad8670...|d441267103dc9f5bc...|Mencoba Membuka P...|
|20091027143614|    20040929172255|http://geocities....|     2bok.txt|      txt|          text/plain|    text/plain|5770633f8db92fa92...|c9fce5df4fc48fa32...|//Ada foto "sai-k...|
|20091027143614|    20040929183551|http://geocities....|or-persis.txt|      txt|          text/plain|    text/plain|67395ee85f65defeb...|1770bfc0429015c26...|OK TEJ\r\nKASIH F...|
|20091027143615|    20040929172255|http://geocities....|  2teras2.txt|      txt|          text/plain|    text/plain|5e220b3462bf1ddd3...|0a8e612d0bba58438...|//Ada foto Eko Yu...|
|20091027143615|    20060307030916|http://geocities....|     12-1.txt|      txt|          text/plain|    text/plain|ea593b76a0c226242...|007baa64ed745a838...|Planet Scan on Th...|
|20091027143615|    20040929172255|http://geocities....|  2teras3.txt|      txt|          text/plain|    text/plain|f4239d5c1258e559f...|b6f75b6f657eba68c...|*Upah\r\nRapat Al...|
|20091027143616|    20040929172255|http://geocities....|  2kanan1.txt|      txt|          text/plain|    text/plain|200615b82bfaf7d25...|ce11da3111186d24d...|Mantan DPRD Incar...|
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.plain_text()
df.show()
Will extract all following information from plain text files in a web collection:
- crawl date
- last modified date
- Plain text url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|     filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091027143357|    20050624112247|http://geocities....|      100.txt|      txt|          text/plain|    text/plain|d551b4e1341fedeb1...|6da698a82e6f2cd86...|Checklist With Pe...|
|20091027143453|    20030614122143|http://geocities....| Password.pas|      txt|          text/plain|    text/plain|af14db1afe4e91655...|94639d5ae81de1829...|michael\r\njacob\...|
|20091027143503|    20081106140745|http://www.knatte...|   robots.txt|      txt|          text/plain|    text/plain|48ca1fcf2991ae97e...|a2a115de2795276ab...|# parking, see cv...|
|20091027143514|    20071122094558|http://geocities....| plansite.txt|      txt|          text/plain|    text/plain|e6bda66886cf51a42...|70514bd62a54738dc...|Index.html\r\n\r\...|
|20091027143519|    20070628091530|http://geocities....|  xmldata.txt|      txt|          text/plain|    text/plain|3c66ec8b796fd012a...|934ba89998a9f9089...|set xmlDoc=Create...|
|20091027143530|    20090310131332|http://geocities....|Manhattan.txt|      txt|          text/plain|    text/plain|33bec040c2319f50d...|8b78c2a100dc13687...|Endwich Bank \r\n...|
|20091027143604|    20030802211911|http://geocities....|   WS_FTP.LOG|      txt|          text/plain|    text/plain|b9132d6f9170ece38...|0b66844444a894955...|2003.08.02 03:01 ...|
|20091027143610|    20040929172044|http://geocities....|1-teras-1.txt|      txt|          text/plain|    text/plain|ca1cdfd1b1ae53d02...|2e9d01175b3b549f1...|KASIH FOTO DJAKA ...|
|20091027143610|    20040929172044|http://geocities....|1-kanan-1.txt|      txt|          text/plain|    text/plain|516ed33a73a93f41c...|c7c9ff394e51e8be2...|Jaringan Azhari \...|
|20091027143612|    20040929172255|http://geocities....|    2open.txt|      txt|          text/plain|    text/plain|3e0f80d7b1a3bf060...|c681bd52d84e66a68...|Banteng Boyolali ...|
|20091027143612|    20040929172044|http://geocities....|1-teras-2.txt|      txt|          text/plain|    text/plain|a5a0abbe4138622a3...|61aa0c8a2455d1149...|Targetnya Ngerem ...|
|20091027143613|    20050401210115|http://geocities....|  scans85.txt|      txt|          text/plain|    text/plain|d1387c7eaf4377f02...|b2a18ea8434a34d94...|Sector Spy Report...|
|20091027143613|    20040929172044|http://geocities....|    1-box.txt|      txt|          text/plain|    text/plain|3105034b810cc7fbe...|d99c38f517cab35ab...|Wali Kota Slamet ...|
|20091027143614|    20040929172044|http://geocities....| 1-banner.txt|      txt|          text/plain|    text/plain|210378da87dad8670...|d441267103dc9f5bc...|Mencoba Membuka P...|
|20091027143614|    20040929172255|http://geocities....|     2bok.txt|      txt|          text/plain|    text/plain|5770633f8db92fa92...|c9fce5df4fc48fa32...|//Ada foto "sai-k...|
|20091027143614|    20040929183551|http://geocities....|or-persis.txt|      txt|          text/plain|    text/plain|67395ee85f65defeb...|1770bfc0429015c26...|OK TEJ\r\nKASIH F...|
|20091027143615|    20040929172255|http://geocities....|  2teras2.txt|      txt|          text/plain|    text/plain|5e220b3462bf1ddd3...|0a8e612d0bba58438...|//Ada foto Eko Yu...|
|20091027143615|    20060307030916|http://geocities....|     12-1.txt|      txt|          text/plain|    text/plain|ea593b76a0c226242...|007baa64ed745a838...|Planet Scan on Th...|
|20091027143615|    20040929172255|http://geocities....|  2teras3.txt|      txt|          text/plain|    text/plain|f4239d5c1258e559f...|b6f75b6f657eba68c...|*Upah\r\nRapat Al...|
|20091027143616|    20040929172255|http://geocities....|  2kanan1.txt|      txt|          text/plain|    text/plain|200615b82bfaf7d25...|ce11da3111186d24d...|Mantan DPRD Incar...|
+--------------+------------------+--------------------+-------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
Extract XML Information
Scala RDD
Will not be implemented.
Scala DF
The following script:
import io.archivesunleashed._
import io.archivesunleashed.udfs._
val df = RecordLoader.loadArchives("/path/to/warcs", sc).xml();
df.show()
Will extract all following information from XML files in a web collection:
- crawl date
- last modified date
- XML url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|    filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|20091027143451|    20040825111612|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|995d283bbec75fab9...|ab3510dd03546c0e2...|<xml xmlns:o="urn...|
|20091027143521|    20070628091530|http://geocities....| catalog.xml|      xml|     application/xml|     text/plain|8658fa2b69203f4ca...|48c083d2d2c44f7aa...|  <?xml version="...|
|20091027143605|                  |http://geocities....|  index.html|      xml|            text/xml|     text/plain|68b329da9893e3409...|adc83b19e793491b1...|                  \n|
|20091027143750|    20020410223310|http://www.geocit...|filelist.xml|      xml|     application/xml|     text/plain|cf06c050636f13004...|80dcc1e1c1c954da1...|<xml xmlns:o="urn...|
|20091027143759|    20020410224422|http://www.geocit...|filelist.xml|      xml|     application/xml|     text/plain|1399c0b1979207eed...|278fc968f5d300836...|<xml xmlns:o="urn...|
|20091027144100|    20090401195922|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|18e9b399cd0f9d6f9...|57db99ccc2449ad27...|<?xml version="1....|
|20091027144136|    20030812120928|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|4034a5e018168f946...|f9b43c3ca1811037b...|<xml xmlns:o="urn...|
|20091027144146|    20090326194824|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|ffbe9a625e027d851...|d78f1dd4f14d62057...|<?xml version="1....|
|20091027144203|    20011208010131|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|309381218ff9f6e10...|ef604ebedfb4a1bf1...|<xml xmlns:o="urn...|
|20091027144214|    20020204215450|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|125f24f030aafbf04...|75f1fbf18bfdc3bb7...|<xml xmlns:o="urn...|
|20091027144213|    20090227121308|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|8db8406401c4efd3f...|2f8292984adbc9554...|<?xml version="1....|
|20091027144240|    20020204215429|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|db87b5aaf61325c44...|fe8c9fb6cd1bf4b01...|<xml xmlns:o="urn...|
|20091027144353|    20020216004601|http://geocities....|master04.xml|      xml|     application/xml|     text/plain|c669a2a31f1eaec70...|6cde4202cff523081...|<xml xmlns:v="urn...|
|20091027144407|    20020216004514|http://geocities....|master03.xml|      xml|     application/xml|     text/plain|08b06e4f5f7f01290...|499a75007921e840d...|<xml xmlns:v="urn...|
|20091027144702|    20040112203005|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|abf2e9d135f128f70...|047f4d1ee41e05c05...|<xml xmlns:o="urn...|
|20091027144707|    20000823213238|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|94a6bb18c1d70e878...|a3ea0e620441e3d6a...|<xml xmlns:o="urn...|
|20091027144835|    20020322203618|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|aec7bf975bd5133bc...|9d1284eec48acf842...|<xml xmlns:o="urn...|
|20091027144926|    20060313173232|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|5db99547f7684fca3...|3a6edbd7cab3863ca...|<xml xmlns:o="urn...|
|20091027145209|    20011013135956|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|bb20bbce2bfe6c12d...|af5cc58c4cff52321...|<xml xmlns:o="urn...|
|20091027145221|    20011015183425|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|d62f8146726b13a9d...|efbd2bb87b72e56ea...|<xml xmlns:o="urn...|
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
Python DF
The following script:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/warcs")
df = archive.xml()
df.show()
Will extract all following information from XML files in a web collection:
- crawl date
- last modified date
- XML url
- filename
- extension
- MimeType as identified by the hosting web server
- MimeType as identified by Apache Tika
- md5 hash
- sha1 hash
- content
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|    crawl_date|last_modified_date|                 url|    filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|             content|
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|20091027143451|    20040825111612|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|995d283bbec75fab9...|ab3510dd03546c0e2...|<xml xmlns:o="urn...|
|20091027143521|    20070628091530|http://geocities....| catalog.xml|      xml|     application/xml|     text/plain|8658fa2b69203f4ca...|48c083d2d2c44f7aa...|  <?xml version="...|
|20091027143605|                  |http://geocities....|  index.html|      xml|            text/xml|     text/plain|68b329da9893e3409...|adc83b19e793491b1...|                  \n|
|20091027143750|    20020410223310|http://www.geocit...|filelist.xml|      xml|     application/xml|     text/plain|cf06c050636f13004...|80dcc1e1c1c954da1...|<xml xmlns:o="urn...|
|20091027143759|    20020410224422|http://www.geocit...|filelist.xml|      xml|     application/xml|     text/plain|1399c0b1979207eed...|278fc968f5d300836...|<xml xmlns:o="urn...|
|20091027144100|    20090401195922|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|18e9b399cd0f9d6f9...|57db99ccc2449ad27...|<?xml version="1....|
|20091027144136|    20030812120928|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|4034a5e018168f946...|f9b43c3ca1811037b...|<xml xmlns:o="urn...|
|20091027144146|    20090326194824|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|ffbe9a625e027d851...|d78f1dd4f14d62057...|<?xml version="1....|
|20091027144203|    20011208010131|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|309381218ff9f6e10...|ef604ebedfb4a1bf1...|<xml xmlns:o="urn...|
|20091027144214|    20020204215450|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|125f24f030aafbf04...|75f1fbf18bfdc3bb7...|<xml xmlns:o="urn...|
|20091027144213|    20090227121308|http://geocities....| sitemap.xml|      xml|     application/xml|application/xml|8db8406401c4efd3f...|2f8292984adbc9554...|<?xml version="1....|
|20091027144240|    20020204215429|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|db87b5aaf61325c44...|fe8c9fb6cd1bf4b01...|<xml xmlns:o="urn...|
|20091027144353|    20020216004601|http://geocities....|master04.xml|      xml|     application/xml|     text/plain|c669a2a31f1eaec70...|6cde4202cff523081...|<xml xmlns:v="urn...|
|20091027144407|    20020216004514|http://geocities....|master03.xml|      xml|     application/xml|     text/plain|08b06e4f5f7f01290...|499a75007921e840d...|<xml xmlns:v="urn...|
|20091027144702|    20040112203005|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|abf2e9d135f128f70...|047f4d1ee41e05c05...|<xml xmlns:o="urn...|
|20091027144707|    20000823213238|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|94a6bb18c1d70e878...|a3ea0e620441e3d6a...|<xml xmlns:o="urn...|
|20091027144835|    20020322203618|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|aec7bf975bd5133bc...|9d1284eec48acf842...|<xml xmlns:o="urn...|
|20091027144926|    20060313173232|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|5db99547f7684fca3...|3a6edbd7cab3863ca...|<xml xmlns:o="urn...|
|20091027145209|    20011013135956|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|bb20bbce2bfe6c12d...|af5cc58c4cff52321...|<xml xmlns:o="urn...|
|20091027145221|    20011015183425|http://geocities....|filelist.xml|      xml|     application/xml|     text/plain|d62f8146726b13a9d...|efbd2bb87b72e56ea...|<xml xmlns:o="urn...|
+--------------+------------------+--------------------+------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+