Dependencies · Archives Unleashed Toolkit

Java

The Toolkit requires Java 11.

For macOS: You can find information on Java here. We recommend OpenJDK. The easiest way is to install with homebrew and then:

brew cask install adoptopenjdk/openjdk/adoptopenjdk11

If you run into difficulties with homebrew, installation instructions can be found here.

On Debian based system you can install Java using apt:

apt install openjdk-11-jdk

Before spark-shell can launch, JAVA_HOME must be set. If you receive an error that JAVA_HOME is not set, you need to point it to where Java is installed. On Linux, this might be export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 or on macOS it might be export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.5.jdk/Contents/Home.

Python

The Toolkit requires Python 3.7.3+

If you would like to use the Archives Unleashed Toolkit with PySpark and Jupyter Notebooks, you'll need to have a modern version of Python installed. We recommend using the Anaconda Distribution. This should install Jupyter Notebook, as well as the PySpark bindings. If it doesn't, you can install either with conda install or pip install.

Apache Spark

The Toolkit requires Apache Spark 3.0.0+

Download and unzip Apache Spark to a location of your choice.

curl -L "https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz" > spark-3.0.0-bin-hadoop2.7.tgz
tar -xvf spark-3.0.0-bin-hadoop2.7.tgz