Change Details

Currently, when wmfdata [creates spark sessions](https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L56), it doesn't pass any settings other than the HTTP proxy, so the defaults are used. This can lead to poor performance in some situations, so we should offer multiple bundles of settings appropriate for different circumstances. Existing recommendations: * Number of threads for a local Spark session * [SWAP local recommendations](https://wikitech.wikimedia.org/wiki/SWAP#Launching_as_SparkSession_in_a_Python_Notebook): 12 * `spark.sql.shuffle.partitions` * default: ?200 * [SWAP YARN recommendation](https://wikitech.wikimedia.org/wiki/SWAP#Launching_as_SparkSession_in_a_Python_Notebook): 600 * `spark.dynamicAllocation.maxExecutors` * default: no limit * PySpark - Local: 128 (this seems weird) * SWAP YARN recommendation: 100 * PySpark - YARN: 128 * PySpark - YARN (large): 128 * `spark.executor.memory` * default: 1 GiB * [shell recommendation](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Start_a_spark_shell_in_yarn): 2 GiB * SWAP YARN recommendation: 4 GiB * PySpark - YARN (large): 4 GiB * `spark.executor.cores` * default: 1 * SWAP YARN recommendation: 2 * `spark.executor.memoryOverhead` * default: executorMemory * 0.10, with minimum of 384 KiB * PySpark - YARN (large): 2 GiB * `spark.driver.memory` * default: 1 GiB * [SWAP local recommendation](https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Start_a_spark_shell_in_yarn): 8 GiB * shell recommendation: 4 GiB * PySpark - YARN (large): 4 GiB