Page MenuHomePhabricator

Give clear recommendations for Spark settings
Closed, ResolvedPublic0 Estimated Story Points

Description

Right now, there are a variety of somewhat inconsistent recommendations from Analytics about what Spark settings we should use for different types of jobs.

It would be very useful to have some clearly recommended bundles of settings, along with guidance about when they should be used.

Existing recommendations:

  • Number of threads for a local Spark session
  • spark.sql.shuffle.partitions
  • spark.dynamicAllocation.maxExecutors
    • default: no limit
    • PySpark - Local: 128 (this seems weird)
    • SWAP YARN recommendation: 100
    • PySpark - YARN: 128
    • PySpark - YARN (large): 128
  • spark.executor.memory
    • default: 1 GiB
    • shell recommendation: 2 GiB
    • SWAP YARN recommendation: 4 GiB
    • PySpark - YARN (large): 4 GiB
  • spark.executor.cores
    • default: 1
    • SWAP YARN recommendation: 2
  • spark.executor.memoryOverhead
    • default: executorMemory * 0.10, with minimum of 384 KiB
    • PySpark - YARN (large): 2 GiB
  • spark.driver.memory

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf moved this task from Triage to Tracking on the Product-Analytics board.

There is some misunderstanding here between recommendations and examples IMO. the links pasted in the task definition show examples, not recommendations, and I don't think they have been advertised as such.

About recommendations, there is no one-size-fits-all, as usual. The approach we have used so far with embedded spark kernels is to suggest 3 sizes depending on how much data is processed.
I'll replicate similar settings here except for the spark-local one.
The settings use heavy-executors (8g memory and 4 cpu-cores), and provides more or less of them depending on job size, adjusting driver memory setting and defualt sql-shuffle partitions accordingly.

  • Regular Jobs - Up to almost 15% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "2g",
    "spark.dynamicAllocation.maxExecutors": 64,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 256
}
  • Big Jobs - Up to almost 30% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}

Another parameter we can play with spark.driver.maxResultSize. It defaults to 1g and could possibly be set to more in case some bigger result is to be collected.
1g Seems big enough for the default so I don't override it.

I have tested the big-jobs setting with the query from P10482:

import wmfdata as wmf

SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}


state_views = wmf.hive.run("""
SELECT
  subdivision AS state,
  CONCAT_WS("-", year, LPAD(month, 2, "0"), LPAD(day, 2, "0")) AS date,
  COUNT(1) AS views
FROM wmf.pageview_hourly
WHERE
  country_code = "IN" AND
  user_agent_map["os_family"] IN ("Firefox OS", "KaiOS") AND
  (
    year = 2019 AND month >= 11 OR
    year = 2020
   )
GROUP BY
  subdivision,
  CONCAT_WS("-", year, LPAD(month, 2, "0"), LPAD(day, 2, "0"))
""", spark_config=SPARK_CONFIG)

Job succeeded in ~7mins.
Let's continue to communicate and make sure we provide you folks with better defaults and a better understanding of how to tweak them :)

There is some misunderstanding here between recommendations and examples IMO. the links pasted in the task definition show examples, not recommendations, and I don't think they have been advertised as such.

I see your point, but when the examples prominently feature those settings, that does imply that they're good starting points. If these are the best starting points, maybe the examples should use them instead. Just something to keep in mind! 😁

About recommendations, there is no one-size-fits-all, as usual. The approach we have used so far with embedded spark kernels is to suggest 3 sizes depending on how much data is processed.
I'll replicate similar settings here except for the spark-local one.
The settings use heavy-executors (8g memory and 4 cpu-cores), and provides more or less of them depending on job size, adjusting driver memory setting and defualt sql-shuffle partitions accordingly.

  • Regular Jobs - Up to almost 15% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "2g",
    "spark.dynamicAllocation.maxExecutors": 64,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 256
}
  • Big Jobs - Up to almost 30% of cluster resources
SPARK_CONFIG = {
    "spark.driver.memory": "4g",
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "8g",
    "spark.executor.cores": 4,
    "spark.sql.shuffle.partitions": 512
}

Thank you very much! This is exactly what I had I mind and I will definitely incorporate these into wmfdata.

However, these are actually very different than the settings for the YARN kernel and YARN - Large kernel (I included the settings for these kernels when I was collecting "recommendations" in the description).

Should those be updated?

Job succeeded in ~7mins.
Let's continue to communicate and make sure we provide you folks with better defaults and a better understanding of how to tweak them :)

Absolutely! I'm very happy to have your help.

Change 574710 had a related patch set uploaded (by Joal; owner: Joseph Allemandou):
[analytics/jupyterhub/deploy@master] Update spark kernels settings for consistency

https://gerrit.wikimedia.org/r/574710

If these are the best starting points, maybe the examples should use them instead.

I have updated the SWAP and Spark wikitech pages with default values that match the ones presented above, and sent a code review for the SWAP spark-kernels :)

JAllemandou added a project: Analytics-Kanban.
JAllemandou set Final Story Points to 3.
JAllemandou moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 574710 merged by Ottomata:
[analytics/jupyterhub/deploy@master] Update spark kernels settings for consistency

https://gerrit.wikimedia.org/r/574710

Merged and deployed updated spark kernels on notebook1003 and notebook1004.

Thank y'all so much! This is very helpful, and the consistency really helps reduce the mental overhead for us.

@JAllemandou, thank you for updating Wikitech! I've added to that by updating the Spark page with a more explicit section about settings. Please review that and make sure I haven't introduced any errors 😊

FYI, from my perspective, this is done. Thanks again!

Nuria set the point value for this task to 0.