Page MenuHomePhabricator

Update wmfdata to use sensible Spark settings
Closed, ResolvedPublic

Description

Currently, when wmfdata creates spark sessions, it doesn't pass any settings other than the HTTP proxy, so the defaults are used.

This can lead to poor performance in some situations, so we should offer multiple bundles of settings appropriate for different circumstances. It's not completely clear what the best settings are (T245897), but we have a reasonably good idea, so we should implement our best guess while we're waiting for that clarity.

Event Timeline

nshahquinn-wmf triaged this task as Medium priority.
nshahquinn-wmf created this task.

I've requested input from Analytics on their email list, and will push the update once I've heard from them.

In the meantime, if you're having trouble with wmfdata.hive (particularly Java memory errors), try rerunning it with modified Spark options as follows and let me know how it goes.

SPARK_CONFIG = {
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "4g",
    "spark.executor.cores": 2
}

hive.run(QUERY, spark_config=SPARK_CONFIG)
elukey added a subscriber: elukey.

Explicitly adding the Analytics tag to get this task triaged.

Explicitly adding the Analytics tag to get this task triaged.

Thanks, @elukey! I'm trying to represent these various Spark issues better in Phab, so I've split my request to Analytics into T245897.

@JAllemandou can better davice but these settings probably need to be query/job specific. Doe snot seem like having common ones for all usages of the library is the best way to proceed.

@JAllemandou can better davice but these settings probably need to be query/job specific. Doe snot seem like having common ones for all usages of the library is the best way to proceed.

He already did in T245897 (thanks, @JAllemandou!). While it would be optimal for us to tune Spark separately for each query, our work involves a constant stream of one-off queries and it just isn't feasible. I think the most we can reasonably do is choose between 3-4 query "engines" (such as Presto, the Hive CLI, a "regular" Spark session, and a "large" Spark session) depending on the workload; feel free to comment on my plan for enabling that: T246060.

Also worth keeping in mind: when the number of options we have to juggle increases, so does the burden on your team in documenting the options and helping us debug problematic queries. "Use the Hive CLI for everything" is a lot easier to support than "follow this 25 step flow chart to decide how to run your query". 🙂

nshahquinn-wmf raised the priority of this task from Medium to High.Feb 26 2020, 6:52 PM
nshahquinn-wmf moved this task from Blocked to Doing on the Product-Analytics (Kanban) board.

The code is written, and is being reviewed along with the code for T246060!