Update wmfdata to use sensible Spark settings
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Feb 13 2020, 1:45 AM

Description

Currently, when wmfdata creates spark sessions, it doesn't pass any settings other than the HTTP proxy, so the defaults are used.

This can lead to poor performance in some situations, so we should offer multiple bundles of settings appropriate for different circumstances. It's not completely clear what the best settings are (T245897), but we have a reasonably good idea, so we should implement our best guess while we're waiting for that clarity.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		nshahquinn-wmf	T245891 Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
		Resolved		nshahquinn-wmf	T245097 Update wmfdata to use sensible Spark settings

Event Timeline

nshahquinn-wmf claimed this task.Feb 13 2020, 1:45 AM

nshahquinn-wmf triaged this task as Medium priority.

nshahquinn-wmf created this task.

nshahquinn-wmf edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

nshahquinn-wmf moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.Feb 13 2020, 2:01 AM

nshahquinn-wmf updated the task description. (Show Details)Feb 14 2020, 2:42 AM

nshahquinn-wmf updated the task description. (Show Details)

I've requested input from Analytics on their email list, and will push the update once I've heard from them.

In the meantime, if you're having trouble with wmfdata.hive (particularly Java memory errors), try rerunning it with modified Spark options as follows and let me know how it goes.

SPARK_CONFIG = {
    "spark.dynamicAllocation.maxExecutors": 128,
    "spark.executor.memory": "4g",
    "spark.executor.cores": 2
}

hive.run(QUERY, spark_config=SPARK_CONFIG)

MGerlach subscribed.Feb 17 2020, 9:45 AM

MGerlach mentioned this in T242599: Start research project on navigation paths "How we read wikipedia" .Feb 18 2020, 12:41 PM

nettrom_WMF subscribed.Feb 18 2020, 5:21 PM

nshahquinn-wmf mentioned this in T245713: wmfdata cannot recover from a crashed Spark session.Feb 20 2020, 5:05 AM

Explicitly adding the Analytics tag to get this task triaged.

Ottomata moved this task from Incoming to Smart Tools for Better Data on the Analytics board.Feb 20 2020, 5:40 PM

nshahquinn-wmf added a parent task: T245891: Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Feb 21 2020, 11:52 PM

nshahquinn-wmf mentioned this in T245896: Spark applications crash when running large queries.Feb 22 2020, 12:32 AM

nshahquinn-wmf updated the task description. (Show Details)Feb 22 2020, 1:44 AM

In T245097#5900537, @elukey wrote:

Explicitly adding the Analytics tag to get this task triaged.

Thanks, @elukey! I'm trying to represent these various Spark issues better in Phab, so I've split my request to Analytics into T245897.

@JAllemandou can better davice but these settings probably need to be query/job specific. Doe snot seem like having common ones for all usages of the library is the best way to proceed.

In T245097#5917077, @Nuria wrote:

@JAllemandou can better davice but these settings probably need to be query/job specific. Doe snot seem like having common ones for all usages of the library is the best way to proceed.

He already did in T245897 (thanks, @JAllemandou!). While it would be optimal for us to tune Spark separately for each query, our work involves a constant stream of one-off queries and it just isn't feasible. I think the most we can reasonably do is choose between 3-4 query "engines" (such as Presto, the Hive CLI, a "regular" Spark session, and a "large" Spark session) depending on the workload; feel free to comment on my plan for enabling that: T246060.

Also worth keeping in mind: when the number of options we have to juggle increases, so does the burden on your team in documenting the options and helping us debug problematic queries. "Use the Hive CLI for everything" is a lot easier to support than "follow this 25 step flow chart to decide how to run your query". 🙂

nshahquinn-wmf raised the priority of this task from Medium to High.Feb 26 2020, 6:52 PM

nshahquinn-wmf moved this task from Blocked to Doing on the Product-Analytics (Kanban) board.

The code is written, and is being reviewed along with the code for T246060!

nshahquinn-wmf added a project: Wmfdata-Python.Mar 11 2020, 7:28 AM

nshahquinn-wmf mentioned this in T245891: Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Mar 13 2020, 6:15 PM

Update wmfdata to use sensible Spark settings Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Update wmfdata to use sensible Spark settings
Closed, ResolvedPublic
Actions

Related Objects
Search...