For a couple days now we have had a test DAG test_generic_artifact_deployment_dag fail. First failure on 2023-08-18, 13:40:16 UTC
We now have a production DAG failing as well: country_project_page_daily_dag.
They both fail with same stack:
[2023-08-21, 01:25:36 UTC] {taskinstance.py:1824} ERROR - Task failed with exception Traceback (most recent call last): File "/usr/lib/airflow/lib/python3.10/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute self._hook.submit(self._application) File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/spark.py", line 436, in submit return self._skein_hook.submit() File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/skein.py", line 210, in submit self._application_id = self._client.submit(self._application_spec) File "/usr/lib/airflow/lib/python3.10/functools.py", line 981, in __get__ val = self.func(instance) File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/skein.py", line 94, in _client return skein.Client(**self._client_kwargs) File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 357, in __init__ self._call('ping', proto.Empty()) File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 280, in _call raise ConnectionError("Unable to connect to %s" % self._server_name) skein.exceptions.ConnectionError: Unable to connect to driver
Note that the operators failing are all Spark Skein operators. They fail before launching the Yarn job.
Looks like @Ottomata had tripped on a similar issue before: https://github.com/jcrist/skein/issues/165
It seems like we need SRE to take a look on whether a config change was implemented close to 2023-08-18, 13:40:16 UTC ?