Page MenuHomePhabricator

wmfdata cannot recover from a crashed Spark session
Open, LowPublic


After a Spark crash (such as those caused by T245896), a Spark application can be left in a defective state that blocks a new session from being created. Some of its functions that do not involve data processing (such as returning its display representation or passing a query to .sql) might complete without an error; however, anything that does involve data processing (like .collect or .toPandas) returns with an immediate error. Examples include

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
Py4JJavaError: An error occurred while calling o638.sql.
: java.lang.NullPointerException

In some cases, the ApplicationMaster continues to run; running a new query causes the application to start a new job and keep its allocated executors, but never make any progress. For example:

In this case, explicitly calling .stop on the SparkSession causes a series of additional ApplicationMasters to be created before the application is finally stopped:

In some cases, retrieving the PySpark session object using wmfdata.spark.get_session and explicitly calling .stop on it allows a new, functioning session to be created. In other cases, however, the "new" session still has errors like the following:

Py4JJavaError: An error occurred while calling
: org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.

Event Timeline

As I continue to encounter this problem in my work, I'm continuing to investigate and develop the description. I would appreciate any additions, suggestions, or clarifying questions!

Is there any pattern to the types of queries that cause this, or does it just happen randomly? Is there a query you can always reproduce this with? Does the same problem happen if you use python + wmfdata outside of Jupyter? What about just a spark session without wmfdata?

Also, I see in one of your screenshots that you have a toPandas call. (I guess this is what wmfdata.hive does). This will bring down the entire result from the spark executors in the local master process. Is the result you are trying to bring down too big?

@Ottomata, thanks for jumping in! See my responses in T245896.

@Ottomata, on the newly refined subject of this task (being unable to recover from a Spark error), I just compiled a complete log of the commands and output for one example I just experienced: P10482

In this case, I really don't want to restart since I have previous query results stored in Python variables that I don't want to discard. I could save them a data files and then reload them, but that gets really tedious and it really seems like there has to be a better way.

@nshahquinn-wmf sorry for not shimming in earlier.
I will try to provide explanations and ideas and suggest ways to deal with the problem.

Similarly to a python process (for instance), a spark-context only keeps its internal state as long as it lives. When it crashes, is killed, or just closed, all information it contains (computed data, defined variables, etc) is lost.
One the thing that may differ and that should be remembered is that there is a yarn-application handling the spark one. The Yarn application is the link between Spark and Hadoop so that Spark can access the resources on the cluster. Usually, when a spark-context dies, its yarn application dies as well, but there might be corner-cases in which the spark-context is gone while the yarn application is not. In that case, the yarn application should be manually killed.
Another thing to keep in mind when using Spark in notebooks is that the spark-context is different from the python Kernel of the notebook. The spark-context can be closed while the python kernel is not, and vice-versa.

The way to deal with data handled by spark-context is to explicitly save it as needed, whether in Hdfs with Spark if big, or on smaller files (hdfs or local) using python, after having collected the data from spark (only valid for small data).

Now, spark-context failing regularly is not expected. The corner-case I can see however is Kerberos related: if the notebook stays up for some time, I think the Kerberos ticket used to launch the spark process will at some point be invalid, creating a crash. That means Spark-handled data should only be kept in flight while working on the notebook. Data to be kept alive longer should be collected onto the python kernel.

Finally, when a spark-context dies from a notebook, restarting a new one should be feasible. That means decoupling code from spark and python in cells to allow rerunning spark-only code is a good practice. Let's see how relaunching a spark-context could be made possible through wmfdata.

Let's continue to talk a bout that, I promise to be more reactive :)

LGoto triaged this task as High priority.
LGoto moved this task from Triage to Needs Investigation on the Product-Analytics board.

@nshahquinn-wmf Kerberos tickets expire after 1 day, once they do spark context will no longer work, so you need to restart your kernel entirely. We will be extending expiration of those a bit but regardless, a notebook left idle for a few days would need a restart, it is our plan to have -similar to how it is done now for screen sessions - a notebook KILLER so notebooks left idle are shutdown so they provide a clean start when re-awaken and do not consume resources

nshahquinn-wmf lowered the priority of this task from High to Medium.Mar 11 2020, 2:42 PM

Spark crashes are much less frequent now due to T245097, so I'm lowering the priority.

However, if a crash happens, it's still the case that we can't recover from it without a restart of the IPython kernel.

Per conversation with Neil, Moving to the backlog since this hasn't come up as a major issue

nshahquinn-wmf lowered the priority of this task from Medium to Low.