DataFrame Schemas
Below you can find all of the DataFrame schemas available in the Toolkit. For
example, you can use .all() to extract the overall content from a web archive
record. Some of the most popular ones include .all() (which includes content,
URLs, and file types); .webpages() (which includes full-text content and
language); and .webgraph() which includes hyperlink information.
All
.all()
crawl_date(string)url(string)mime_type_web_server(string)mime_type_tika(string)content(string)bytes(binary)http_status_code(string)archive_filename(string)
Web Pages
.webpages()
crawl_date(string)url(string)mime_type_web_server(string)mime_type_tika(string)language(string)content(string)
Web Graph
.webgraph()
crawl_date(string)src(string)dest(string)anchor(string)
Image Graph
.imagegraph()
crawl_date(string)src(string)image_url(string)alt_text(string)
Images
.images()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)width(string)height(string)md5(string)sha1(string)bytes(binary)
PDFs
.pdfs()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)
Audio
.audio()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)
Videos
.videos()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)
Spreadsheets
.spreadsheets()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)
Presentation Program Files
.presentationProgramFiles()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)
Word Processor Files
.wordProcessorFiles()
crawl_date(string)url(string)filename(string)extension(string)mime_type_web_server(string)mime_type_tika(string)md5(string)sha1(string)bytes(binary)