Right now, there are a variety of somewhat inconsistent recommendations from Analytics about what Spark settings we should use for different types of jobs.
It would be very useful to have some clearly recommended bundles of settings, along with guidance about when they should be used.
Existing recommendations:
- Number of threads for a local Spark session
- spark.sql.shuffle.partitions
- default: 200
- SWAP YARN recommendation: 600
- spark.dynamicAllocation.maxExecutors
- default: no limit
- PySpark - Local: 128 (this seems weird)
- SWAP YARN recommendation: 100
- PySpark - YARN: 128
- PySpark - YARN (large): 128
- spark.executor.memory
- default: 1 GiB
- shell recommendation: 2 GiB
- SWAP YARN recommendation: 4 GiB
- PySpark - YARN (large): 4 GiB
- spark.executor.cores
- default: 1
- SWAP YARN recommendation: 2
- spark.executor.memoryOverhead
- default: executorMemory * 0.10, with minimum of 384 KiB
- PySpark - YARN (large): 2 GiB
- spark.driver.memory
- default: 1 GiB
- SWAP local recommendation: 8 GiB
- shell recommendation: 4 GiB
- PySpark - YARN (large): 4 GiB