wmfdata cannot recover from a crashed Spark session
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Feb 20 2020, 5:05 AM

Description

After a Spark crash (such as those caused by T245896), a Spark application can be left in a defective state that blocks a new session from being created. Some of its functions that do not involve data processing (such as returning its display representation or passing a query to .sql) might complete without an error; however, anything that does involve data processing (like .collect or .toPandas) returns with an immediate error. Examples include

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext

Py4JJavaError: An error occurred while calling o638.sql.
: java.lang.NullPointerException

In some cases, the ApplicationMaster continues to run; running a new query causes the application to start a new job and keep its allocated executors, but never make any progress. For example:

In this case, explicitly calling .stop on the SparkSession causes a series of additional ApplicationMasters to be created before the application is finally stopped:

In some cases, retrieving the PySpark session object using wmfdata.spark.get_session and explicitly calling .stop on it allows a new, functioning session to be created. In other cases, however, the "new" session still has errors like the following:

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		nshahquinn-wmf	T245891 Analysts cannot reliably use wmfdata to run SQL queries against Hive databases
		Resolved		nshahquinn-wmf	T245713 wmfdata cannot recover from a crashed Spark session

Event Timeline

nshahquinn-wmf created this task.Feb 20 2020, 5:05 AM

nettrom_WMF subscribed.Feb 20 2020, 5:15 PM

Ottomata moved this task from Incoming to Radar on the Analytics board.Feb 20 2020, 5:49 PM

nshahquinn-wmf updated the task description. (Show Details)Feb 21 2020, 6:52 PM

nshahquinn-wmf updated the task description. (Show Details)Feb 21 2020, 6:59 PM

nshahquinn-wmf updated the task description. (Show Details)Feb 21 2020, 7:11 PM

nshahquinn-wmf updated the task description. (Show Details)

As I continue to encounter this problem in my work, I'm continuing to investigate and develop the description. I would appreciate any additions, suggestions, or clarifying questions!

Is there any pattern to the types of queries that cause this, or does it just happen randomly? Is there a query you can always reproduce this with? Does the same problem happen if you use python + wmfdata outside of Jupyter? What about just a spark session without wmfdata?

Also, I see in one of your screenshots that you have a toPandas call. (I guess this is what wmfdata.hive does). This will bring down the entire result from the spark executors in the local master process. Is the result you are trying to bring down too big?

nshahquinn-wmf added a parent task: T245891: Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Feb 21 2020, 11:52 PM

nshahquinn-wmf updated the task description. (Show Details)Feb 22 2020, 1:50 AM

nshahquinn-wmf mentioned this in T245896: Spark applications crash when running large queries.Feb 22 2020, 2:15 AM

@Ottomata, thanks for jumping in! See my responses in T245896.

@Ottomata, on the newly refined subject of this task (being unable to recover from a Spark error), I just compiled a complete log of the commands and output for one example I just experienced: P10482

In this case, I really don't want to restart since I have previous query results stored in Python variables that I don't want to discard. I could save them a data files and then reload them, but that gets really tedious and it really seems like there has to be a better way.

@nshahquinn-wmf sorry for not shimming in earlier.
I will try to provide explanations and ideas and suggest ways to deal with the problem.

Similarly to a python process (for instance), a spark-context only keeps its internal state as long as it lives. When it crashes, is killed, or just closed, all information it contains (computed data, defined variables, etc) is lost.
One the thing that may differ and that should be remembered is that there is a yarn-application handling the spark one. The Yarn application is the link between Spark and Hadoop so that Spark can access the resources on the cluster. Usually, when a spark-context dies, its yarn application dies as well, but there might be corner-cases in which the spark-context is gone while the yarn application is not. In that case, the yarn application should be manually killed.
Another thing to keep in mind when using Spark in notebooks is that the spark-context is different from the python Kernel of the notebook. The spark-context can be closed while the python kernel is not, and vice-versa.

The way to deal with data handled by spark-context is to explicitly save it as needed, whether in Hdfs with Spark if big, or on smaller files (hdfs or local) using python, after having collected the data from spark (only valid for small data).

Now, spark-context failing regularly is not expected. The corner-case I can see however is Kerberos related: if the notebook stays up for some time, I think the Kerberos ticket used to launch the spark process will at some point be invalid, creating a crash. That means Spark-handled data should only be kept in flight while working on the notebook. Data to be kept alive longer should be collected onto the python kernel.

Finally, when a spark-context dies from a notebook, restarting a new one should be feasible. That means decoupling code from spark and python in cells to allow rerunning spark-only code is a good practice. Let's see how relaunching a spark-context could be made possible through wmfdata.

Let's continue to talk a bout that, I promise to be more reactive :)

Ottomata mentioned this in T245891: Analysts cannot reliably use wmfdata to run SQL queries against Hive databases.Feb 24 2020, 7:23 PM

LGoto assigned this task to kzimmerman.Feb 24 2020, 8:46 PM

LGoto triaged this task as High priority.

LGoto moved this task from Triage to Needs Investigation on the Product-Analytics board.

nshahquinn-wmf claimed this task.Feb 25 2020, 2:26 AM

nshahquinn-wmf added a subscriber: kzimmerman.

@nshahquinn-wmf Kerberos tickets expire after 1 day, once they do spark context will no longer work, so you need to restart your kernel entirely. We will be extending expiration of those a bit but regardless, a notebook left idle for a few days would need a restart, it is our plan to have -similar to how it is done now for screen sessions - a notebook KILLER so notebooks left idle are shutdown so they provide a clean start when re-awaken and do not consume resources

kzimmerman edited projects, added Product-Analytics (Kanban); removed Product-Analytics.Mar 2 2020, 5:38 PM

kzimmerman moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

nshahquinn-wmf mentioned this in T247060: Create a wmfdata-python project.Mar 6 2020, 10:39 AM

Aklapper added a project: Wmfdata-Python.Mar 10 2020, 5:56 PM

Spark crashes are much less frequent now due to T245097, so I'm lowering the priority.

However, if a crash happens, it's still the case that we can't recover from it without a restart of the IPython kernel.

nshahquinn-wmf moved this task from Doing to Next 2 weeks on the Product-Analytics (Kanban) board.Mar 11 2020, 2:46 PM

Per conversation with Neil, Moving to the backlog since this hasn't come up as a major issue

kzimmerman moved this task from Needs Investigation to Backlog on the Product-Analytics board.Apr 16 2020, 6:46 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.Apr 19 2021, 4:52 PM

nshahquinn-wmf lowered the priority of this task from Medium to Low.

Adding Data-Engineering to all Wmfdata-Python tasks, as requested by Dan and Andrew.

• EChetty moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Mar 24 2022, 3:05 PM

• EChetty moved this task from Event Platform Backlog to WMF-Data on the Data-Engineering board.Mar 28 2022, 10:53 AM

Thanks to T273210, Wmfdata now has the ability to recreate Spark sessions in the same notebook, which should give it the ability to easily recover from a crashed Spark session.

	F31625801: Screen Shot 2020-02-21 at 14.07.58.png
	Feb 21 2020, 7:11 PM

	F31625791: Screen Shot 2020-02-21 at 13.56.58.png
	Feb 21 2020, 6:59 PM

wmfdata cannot recover from a crashed Spark sessionClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

wmfdata cannot recover from a crashed Spark session
Closed, ResolvedPublic
Actions

Related Objects
Search...