Page MenuHomePhabricator

Mjolnir is sometimes stuck in feature selection
Closed, ResolvedPublic5 Estimated Story Points

Description

Since October 2024 we started to see issues with the feature selection task.

The spark job seems to get stuck in the approxQuantile stage apparently doing nothing...

The cause is not clear at this point and restarting the task might succeed, what is particularly problematic is that the job never ends and can keep running for several days until an operator kills it and it's configured to use the sequential pool it is blocking other unrelated tasks using this same pool in search airflow instance.

AC:

  • mitigate the issue by putting mjolnir in its own pool
  • try to understand why it gets stuck and fix the cause
  • make sure this task does not run forever and prefer failing if it's running for more than X hours (X to be determined but usually this task takes between 1h30 & 2h00 to run)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

On 2025-01-13 we had a possibly related failure in the feature collection step.

The skein launcher failed without logging a Spark app ID, but a Spark job was seemingly executed. A related YARN app (mjolnir_weekly__feature_vectors-norm_query-20180215-query_explorer__20250103) was RUNNING (YARN app). It started Jan 11, 2025, matching the failed Airflow task timeline (Airflow task).

However, the Spark application (Job status) had no worker task marked as active/running, just what looks like an idle driver process. Output data exists in mjolnir.feature_vectors for 2025-01-03 for all target wikis, but is roughly half the previous run. It's unclear if this was to be expected.
The main driver thread is waiting on PythonRunner:

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:101)
org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The resolution for now was to kill the dangling Spark app, and re-run the feature collection airflow task.

Gehel set the point value for this task to 5.Jan 13 2025, 4:51 PM

However, the Spark application (Job status) had no worker task marked as active/running, just what looks like an idle driver process [,,,] it is unclear if this was to be expected.

We came across this scenario again in a 2025-01-16 attempt. This looks like expected behaviour (to a degree), triggered by:
https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/mjolnir/kafka/client.py?ref_type=heads#L116

Spark was waiting for an ack of work completion; eventually the job resumed. We should probably generate some metrics re expected waiting times in this code path.

However, the Spark application (Job status) had no worker task marked as active/running, just what looks like an idle driver process [,,,] it is unclear if this was to be expected.

We came across this scenario again in a 2025-01-16 attempt. This looks like expected behaviour (to a degree), triggered by:
https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/mjolnir/kafka/client.py?ref_type=heads#L116

My assumption is that we might have encountered a race condition while waiting for or acknowledging a sigil event of work completion. At the time, multiple instances of this Spark job were running (due to a regression introduced during the migration of Airflow-Search to Kubernetes), all reading from and writing to the same topic. FWIW, in the runs executed after the multiple job spawning bug was fixed, the task completed successfully.

This task got stuck again, and we did not notice for a while (~346 hours).

I suspect the approxQuantile execution point is a red herring. The Spark UI reports active tasks, when in fact none are running.

Looking at a thread dump for the Driver, it shows that it is waiting for worker threads to reply, but all executors seem to be dead
or deallocated. Killing the reported active approxQuantile task resulted in the entire job being killed.

I paired on this with @JAllemandou, but the root cause of the issue is not yet clear. What is surprising is that the Spark job
should have been killed by our tooling after a week of execution, but it was not. This will require further investigation.

The linked MR implements the workaround suggested in the OP (moving to a dedicated pool and implementing a task timeout).
A second MR bumps memory for this task as an experiment to see if more resources will prevent a livelock.

This task has been running stable for the past few days, since bumping driver memory. Moving to in review.