Give clear recommendations for Spark settings
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Feb 22 2020, 12:40 AM

Description

Right now, there are a variety of somewhat inconsistent recommendations from Analytics about what Spark settings we should use for different types of jobs.

It would be very useful to have some clearly recommended bundles of settings, along with guidance about when they should be used.

Existing recommendations:

Number of threads for a local Spark session
- SWAP local recommendations: 12
spark.sql.shuffle.partitions
- default: 200
- SWAP YARN recommendation: 600
spark.dynamicAllocation.maxExecutors
- default: no limit
- PySpark - Local: 128 (this seems weird)
- SWAP YARN recommendation: 100
- PySpark - YARN: 128
- PySpark - YARN (large): 128
spark.executor.memory
- default: 1 GiB
- shell recommendation: 2 GiB
- SWAP YARN recommendation: 4 GiB
- PySpark - YARN (large): 4 GiB
spark.executor.cores
- default: 1
- SWAP YARN recommendation: 2
spark.executor.memoryOverhead
- default: executorMemory * 0.10, with minimum of 384 KiB
- PySpark - YARN (large): 2 GiB
spark.driver.memory
- default: 1 GiB
- SWAP local recommendation: 8 GiB
- shell recommendation: 4 GiB
- PySpark - YARN (large): 4 GiB

Details

	Subject	Repo	Branch	Lines +/-
	Update spark kernels settings for consistency	analytics/jupyterhub/deploy	master	+10 -10

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		nshahquinn-wmf	T245891 Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
		Resolved		JAllemandou	T245897 Give clear recommendations for Spark settings

Event Timeline

nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.Feb 22 2020, 12:40 AM

nshahquinn-wmf created this task.

nshahquinn-wmf moved this task from Triage to Tracking on the Product-Analytics board.

nshahquinn-wmf mentioned this in T245097: Update wmfdata to use sensible Spark settings .Feb 22 2020, 1:44 AM

Milimetric triaged this task as Medium priority.Feb 24 2020, 4:37 PM

Milimetric moved this task from Incoming to Forgotten Documentation And Dev environment on the Analytics board.

There is some misunderstanding here between recommendations and examples IMO. the links pasted in the task definition show examples, not recommendations, and I don't think they have been advertised as such.

About recommendations, there is no one-size-fits-all, as usual. The approach we have used so far with embedded spark kernels is to suggest 3 sizes depending on how much data is processed.
I'll replicate similar settings here except for the spark-local one.
The settings use heavy-executors (8g memory and 4 cpu-cores), and provides more or less of them depending on job size, adjusting driver memory setting and defualt sql-shuffle partitions accordingly.

Regular Jobs - Up to almost 15% of cluster resources

SPARK_CONFIG = {
    "spark.driver.memory": "2g",
    "spark.dynamicAllocation.maxExecutors": 64,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 256
}

Big Jobs - Up to almost 30% of cluster resources

SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}

Another parameter we can play with spark.driver.maxResultSize. It defaults to 1g and could possibly be set to more in case some bigger result is to be collected.
1g Seems big enough for the default so I don't override it.

I have tested the big-jobs setting with the query from P10482:

import wmfdata as wmf

SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}


state_views = wmf.hive.run("""
SELECT
  subdivision AS state,
  CONCAT_WS("-", year, LPAD(month, 2, "0"), LPAD(day, 2, "0")) AS date,
  COUNT(1) AS views
FROM wmf.pageview_hourly
WHERE
  country_code = "IN" AND
  user_agent_map["os_family"] IN ("Firefox OS", "KaiOS") AND
  (
    year = 2019 AND month >= 11 OR
    year = 2020
   )
GROUP BY
  subdivision,
  CONCAT_WS("-", year, LPAD(month, 2, "0"), LPAD(day, 2, "0"))
""", spark_config=SPARK_CONFIG)

Job succeeded in ~7mins.
Let's continue to communicate and make sure we provide you folks with better defaults and a better understanding of how to tweak them :)

In T245897#5913626, @JAllemandou wrote:

There is some misunderstanding here between recommendations and examples IMO. the links pasted in the task definition show examples, not recommendations, and I don't think they have been advertised as such.

I see your point, but when the examples prominently feature those settings, that does imply that they're good starting points. If these are the best starting points, maybe the examples should use them instead. Just something to keep in mind! 😁

About recommendations, there is no one-size-fits-all, as usual. The approach we have used so far with embedded spark kernels is to suggest 3 sizes depending on how much data is processed.
I'll replicate similar settings here except for the spark-local one.
The settings use heavy-executors (8g memory and 4 cpu-cores), and provides more or less of them depending on job size, adjusting driver memory setting and defualt sql-shuffle partitions accordingly.

Regular Jobs - Up to almost 15% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "2g",
    "spark.dynamicAllocation.maxExecutors": 64,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 256
}
Big Jobs - Up to almost 30% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}

Thank you very much! This is exactly what I had I mind and I will definitely incorporate these into wmfdata.

However, these are actually very different than the settings for the YARN kernel and YARN - Large kernel (I included the settings for these kernels when I was collecting "recommendations" in the description).

Should those be updated?

Job succeeded in ~7mins.
Let's continue to communicate and make sure we provide you folks with better defaults and a better understanding of how to tweak them :)

Absolutely! I'm very happy to have your help.

Change 574710 had a related patch set uploaded (by Joal; owner: Joseph Allemandou):
[analytics/jupyterhub/deploy@master] Update spark kernels settings for consistency

https://gerrit.wikimedia.org/r/574710

gerritbot added a project: Patch-For-Review.Feb 25 2020, 10:12 AM

If these are the best starting points, maybe the examples should use them instead.

I have updated the SWAP and Spark wikitech pages with default values that match the ones presented above, and sent a code review for the SWAP spark-kernels :)

JAllemandou claimed this task.Feb 25 2020, 10:15 AM

JAllemandou added a project: Analytics-Kanban.

JAllemandou set Final Story Points to 3.

JAllemandou moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 574710 merged by Ottomata:
[analytics/jupyterhub/deploy@master] Update spark kernels settings for consistency

https://gerrit.wikimedia.org/r/574710

Merged and deployed updated spark kernels on notebook1003 and notebook1004.

JAllemandou moved this task from In Code Review to Done on the Analytics-Kanban board.Feb 26 2020, 5:34 PM

Thank y'all so much! This is very helpful, and the consistency really helps reduce the mental overhead for us.

@JAllemandou, thank you for updating Wikitech! I've added to that by updating the Spark page with a more explicit section about settings. Please review that and make sure I haven't introduced any errors 😊

Maintenance_bot removed a project: Patch-For-Review.Feb 26 2020, 6:11 PM

FYI, from my perspective, this is done. Thanks again!

• Nuria closed this task as Resolved.Feb 28 2020, 12:16 AM

• Nuria set the point value for this task to 0.

MGerlach subscribed.Mar 11 2020, 10:31 AM