Page MenuHomePhabricator

Multiple DAGs on platform_eng instance failing on Spark Skein operators with ConnectionError
Closed, ResolvedPublic

Description

For a couple days now we have had a test DAG test_generic_artifact_deployment_dag fail. First failure on 2023-08-18, 13:40:16 UTC

We now have a production DAG failing as well: country_project_page_daily_dag.

They both fail with same stack:

[2023-08-21, 01:25:36 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/lib/airflow/lib/python3.10/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 157, in execute
    self._hook.submit(self._application)
  File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/spark.py", line 436, in submit
    return self._skein_hook.submit()
  File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/skein.py", line 210, in submit
    self._application_id = self._client.submit(self._application_spec)
  File "/usr/lib/airflow/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/srv/deployment/airflow-dags/platform_eng/wmf_airflow_common/hooks/skein.py", line 94, in _client
    return skein.Client(**self._client_kwargs)
  File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 357, in __init__
    self._call('ping', proto.Empty())
  File "/usr/lib/airflow/lib/python3.10/site-packages/skein/core.py", line 280, in _call
    raise ConnectionError("Unable to connect to %s" % self._server_name)
skein.exceptions.ConnectionError: Unable to connect to driver

Note that the operators failing are all Spark Skein operators. They fail before launching the Yarn job.

Looks like @Ottomata had tripped on a similar issue before: https://github.com/jcrist/skein/issues/165

It seems like we need SRE to take a look on whether a config change was implemented close to 2023-08-18, 13:40:16 UTC ?

Event Timeline

Marking high priority due to production pipeline being compromised.

BTullis raised the priority of this task from High to Unbreak Now!.Aug 21 2023, 3:32 PM
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

I have regenerated the skein certificates.

btullis@marlin:~/wmf/archiva$ ssh an-airflow1004.eqiad.wmnet
btullis@an-airflow1004:~$ sudo su - analytics-platform-eng 

analytics-platform-eng@an-airflow1004:/home/btullis$ export HOME=/srv/airflow-platform_eng/

analytics-platform-eng@an-airflow1004:/home/btullis$ source /usr/lib/airflow/bin/activate

Checked the date of expiry.

(airflow) analytics-platform-eng@an-airflow1004:/srv/airflow-platform_eng$ openssl x509 -in ~/.skein/skein.crt -dates |head -n 2
notBefore=Aug 17 15:46:03 2022 GMT
notAfter=Aug 17 15:46:03 2023 GMT

Regenerated the certificates and forced overwriting of the existing files

(airflow) analytics-platform-eng@an-airflow1004:/srv/airflow-platform_eng$ skein config gencerts --force

Checked the expiry dates again. They are good for another year.

(airflow) analytics-platform-eng@an-airflow1004:/srv/airflow-platform_eng$ openssl x509 -in ~/.skein/skein.crt -dates |head -n 2
notBefore=Aug 21 15:39:26 2023 GMT
notAfter=Aug 20 15:39:26 2024 GMT
BTullis lowered the priority of this task from Unbreak Now! to High.Aug 21 2023, 3:42 PM

@xcollazo - Could you see if this fixes the immediate issue please?

Thanks @BTullis! All runs of test_generic_artifact_deployment_dag are now green, and country_project_page_daily_dag is running its first failed run. There is already a yarn application ID so its looking good!

Will close this once the first production rerun succeeds.

Great! I'm sorry that this affected you. I will make sure that we get a handle on T329398: Puppetize Skein certificate generation because it got away from me. It's an annual time-bomb for each airflow instance at the moment.

Will close this once the first production rerun succeeds.

Success.

I will make sure that we get a handle on T329398: Puppetize Skein certificate generation because it got away from me.

Ah, will link these two. Thanks for the pointer!

Thanks for taking care of this @xcollazo and @BTullis! really appreciate you catching this while I was OOO