Remove Spark session timeout functionality from Wmfdata-Python
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Dec 22 2021, 3:23 AM

Description

Currently, spark.run (and the draft spark.load_parquet) set a Spark session timeout before returning. This is to avoid application master processes from persisting for a long time when they're not being used. However, this system doesn't work perfectly:

The code is pretty complex and makes it harder to write code for the spark module.
We don't set timeouts on custom sessions (and clear any timeout whenever get_session or get_custom_session is called), because we can't track when those sessions are used. But this also means that a user could potentially get a session, save the handle in a variable, begin using it, use run, go back to using the session directly without calling get_session again, and then unexpectedly have their session die in the middle of using it because of the timeout set by run.

I suggested we simply remove timeouts entirely. This would mean there would be some more reserved application masters staying idle for a long time, but the resource consumption there (1 executor) is pretty tiny in comparison to what actually gets used when an application is run (10s to 100s of executors). If we're concerned about resource use, there's a much better place to control that: at the Jupyter kernel level. If the kernel gets shut down, the driver and application master go with it, and this neatly frees up the resources used by the kernel itself too.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T302819 Replace anaconda-wmf with smaller, non-stacked Conda environments
Resolved	xcollazo	T300442 Release Wmfdata-Python 2.0
Resolved	nshahquinn-wmf	T298179 Remove Spark session timeout functionality from Wmfdata-Python

Event Timeline

But this also means that a user could potentially get a session, save the handle in a variable, begin using it, use run, go back to using the session directly without calling get_session again, and then unexpectedly have their session die in the middle of using it because of the timeout set by run.

@nettrom_WMF maybe this is why you sometimes come back to notebooks and find the Spark session/context dead?

ldelench_wmf triaged this task as Low priority.Jan 11 2022, 6:21 PM

ldelench_wmf moved this task from Triage to Backlog on the Product-Analytics board.

Adding Data-Engineering to all Wmfdata-Python tasks, as requested by Dan and Andrew.

• EChetty moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Mar 24 2022, 2:58 PM

• EChetty moved this task from Event Platform Backlog to WMF-Data on the Data-Engineering board.Mar 28 2022, 10:56 AM

nshahquinn-wmf updated the task description. (Show Details)Oct 12 2022, 6:33 PM

xcollazo subscribed.Oct 13 2022, 11:51 PM

If the kernel gets shut down, the driver and application master go with it, and this neatly frees up the resources used by the kernel itself too.

Right, but closing a kernel depends on user action, while a timeout will happen regardless.

In any case, I agree that doing resource management on the client side is an anti-pattern; it's better to do this server side. There doesn't seem to be a better solution for this server side other than what we already do which is to have dynamic allocation on by default, and to let the executor count drop all the way to zero (which is Spark's default). Perhaps as a sanity measure, we could make it so that a user cannot set spark.dynamicAllocation.enabled=false?

Also, was the timeout implemented for historical reasons? As in perhaps we didn't have dynamic allocation before?

In T298179#8323285, @xcollazo wrote:

If the kernel gets shut down, the driver and application master go with it, and this neatly frees up the resources used by the kernel itself too.

Right, but closing a kernel depends on user action, while a timeout will happen regardless.

Well, here we're talking about special functionality we built to time sessions out without user action. So my point is that similar functionality to cull unused kernels would be a much broader and better solution, since it controls resource use on the stat servers (which is a significant although not serious issue) as well as on the Hadoop cluster. I'm sure there would be challenges in how to do that, but that's equally true here.

In any case, I agree that doing resource management on the client side is an anti-pattern; it's better to do this server side. There doesn't seem to be a better solution for this server side other than what we already do which is to have dynamic allocation on by default, and to let the executor count drop all the way to zero (which is Spark's default). Perhaps as a sanity measure, we could make it so that a user cannot set spark.dynamicAllocation.enabled=false?

That definitely seems reasonable, even though I'm pretty sure we've never run into any issue with that.

Also, was the timeout implemented for historical reasons? As in perhaps we didn't have dynamic allocation before?

No, we had dynamic allocation then. It's just that in addition to the dynamically-allocated executors, any YARN session keeps a permanent hold on one executor to serve as the application master (not sure how big that executor is but it's definitely not bigger than a normal one). The timeout minimizes cases of this (not eliminates, because of the limitations mentioned in the description).

So we're talking about one executor per unused session. That's not nothing when you consider there will likely be multiple users with multiple open sessions, but we have the same situation with unused notebooks taking up memory on the stat servers, and nothing terrible has happened even though there are no controls or timeouts (which is why I said it was a significant but not serious issue).

nshahquinn-wmf mentioned this in T300442: Release Wmfdata-Python 2.0.Nov 7 2022, 5:58 PM

If the timeout is removed, it could be possible to detect and alert when not production yarn applications are running for more than a week, for safety.

In T298179#8378525, @Antoine_Quhen wrote:

If the timeout is removed, it could be possible to detect and alert when not production yarn applications are running for more than a week, for safety.

Yes, this is definitely the type of thing I was thinking about when I mentioned "similar functionality to cull unused kernels"! Just for context, having an application open for more than a week will be common, since I guarantee that it's currently common to have notebooks open for much longer than a week (since you have to go out of your way to shut down notebooks, it's easy to leave them open almost indefinitely).

So if you have an alert for your team when an application has been open for a week, you'll get a ton of alerts, which probably won't be helpful. On the other hand, if it's an alert for an application owner, that will be extremely useful since it will remind people to be conscious of their resource use. Right now, you could leave a notebook or application open for a year and you wouldn't get any feedback (although eventually a server restart would silently close it), so an email alert for the owner would be a great first step.

It might be worth focusing on an alert for open notebooks over one for open applications. Every application is contained within a notebook, more or less, but not all notebooks have applications (but still might be using other resources like stat server memory), so focusing on notebooks gets more return on your investments.

This is currently up for review in PR36.

Waiting for @xcollazo's review.

nshahquinn-wmf added a parent task: T300442: Release Wmfdata-Python 2.0.Nov 15 2022, 2:39 AM

The pull request has been merged!

Remove Spark session timeout functionality from Wmfdata-PythonClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Remove Spark session timeout functionality from Wmfdata-Python
Closed, ResolvedPublic
Actions

Related Objects
Search...