The Toolkit at Scale
As your collections grow, you may need to provide more resources, and adjust
Apache Spark configuration options. Apache Spark has great
Configuration and
Tuning guides that are
worth checking out. If you're not sure where to start with scaling, join us in
Slack in the #aut
channel, and we might be
able to provide some guidance.
A Note on Memory and Cores
As your datasets grow, you may need to provide more memory to Apache Spark. You'll know this if you get an error saying that you have run out of "Java Heap Space."
You can add a configuration option for adjusting available memory like so:
spark-shell --driver-memory 4G --jars /path/to/aut-1.1.0-fatjar.jar
In the above case, you give Apache Spark 4GB of memory to execute the program.
In some other cases, despite giving AUT sufficient memory, you may still encounter Java Heap Space issues. In those cases, it is worth trying to lower the number of worker threads. When running locally (i.e. on a single laptop, desktop, or server), by default AUT runs a number of threads equivalent to the number of cores in your machine.
On a 16-core machine, you may want to drop to 12 cores if you are having memory issues. This will increase stability but decrease performance a bit.
You can do so like this (the example is using 12 threads on a 16-core machine):
spark-shell --master local[12] --driver-memory 4G --jars /path/to/aut-1.1.0-fatjar.jar
If you continue to have errors, look at your output and logs. They will usually
point you in the right direction. For instance, you may also need to increase
the network timeout value. Once in a while, AUT might get stuck on an odd
record and take longer than normal to process it. The --conf spark.network.timeout=1.1.0000
will ensure that AUT continues to work on
material, although it may take a while to process. This command then works:
spark-shell --master local[12] --driver-memory 90G --conf spark.network.timeout=1.1.0000 --jars /path/to/aut-1.1.0-fatjar.jar