The Toolkit at Scale
As your collections grow, you may need to provide more resources, and adjust
Apache Spark configuration options. Apache Spark has great
Configuration and
Tuning guides that are
worth checking out. If you're not sure where to start with scaling, join us in
Slack in the #aut
channel, and we might be
able to provide some guidance.
A Note on Memory and Cores
As your datasets grow, you may need to provide more memory to Apache Spark. You'll know this if you get an error saying that you have run out of "Java Heap Space."
You can add a configuration option for adjusting available memory like so:
spark-shell --driver-memory 4G --jars /path/to/aut-0.80.0-fatjar.jar
In the above case, you give Apache Spark 4GB of memory to execute the program.
In some other cases, despite giving AUT sufficient memory, you may still encounter Java Heap Space issues. In those cases, it is worth trying to lower the number of worker threads. When running locally (i.e. on a single laptop, desktop, or server), by default AUT runs a number of threads equivalent to the number of cores in your machine.
On a 16-core machine, you may want to drop to 12 cores if you are having memory issues. This will increase stability but decrease performance a bit.
You can do so like this (the example is using 12 threads on a 16-core machine):
spark-shell --master local[12] --driver-memory 4G --jars /path/to/aut-0.80.0-fatjar.jar
If you continue to have errors, look at your output and logs. They will usually
point you in the right direction. For instance, you may also need to increase
the network timeout value. Once in a while, AUT might get stuck on an odd
record and take longer than normal to process it. The --conf spark.network.timeout=10000000
will ensure that AUT continues to work on
material, although it may take a while to process. This command then works:
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=10000000 --jars /path/to/aut-0.80.0-fatjar.jar
Reading Data from AWS S3
We also support loading data stored in Amazon S3. This advanced functionality requires that you provide Spark shell with your AWS Access Key and AWS Secret Key, which you will get when creating your AWS credentials (read more here).
This script, for example, will find the top ten domains from a set of WARCs found in an s3 bucket.
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
Reading Data from an S3-like Endpoint
We also support loading data stored in an Amazon S3-like system such as Ceph RADOS. Similar to the above example, you'll need an access key and secret, and additionally, you'll need to define your endpoint.
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
sc.hadoopConfiguration.set("fs.s3a.access.key", "<my-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<my-secret-key>")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "<my-end-point>")
RecordLoader.loadArchives("s3a://<my-bucket>/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
Troubleshooting S3
If you run into this AmazonHttpClient
timeout error:
19/10/24 11:12:51 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:231)
at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:200)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
You can add the following two configuration lines to your script:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 100)