Page MenuHomePhabricator

Airflow search instance scheduler appears to be flapping
Closed, ResolvedPublic

Description

We have been receiving aporadic alerts about AirflowDeploymentUnavailable: airflow-scheduler.airflow-search is unavailable on dse-k8s-eqiad

Upon checking https://airflow-search.wikimedia.org/home we can see an alert about the scheduler having restarted recently.

image.png (1×1 px, 213 KB)

Details

Other Assignee
BTullis

Related Objects

Event Timeline

We can see that the scheduler pod has restarted 59 times and the airflow-gitsync pod has restarted 9 times.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-search$ kubectl get pods
NAME                                                              READY   STATUS      RESTARTS         AGE
airflow-gitsync-d76b48d7f-bj5jb                                   1/1     Running     9 (26d ago)      40d
airflow-kerberos-84b9567694-k48z6                                 1/1     Running     0                13d
airflow-scheduler-68b4b476f6-phchl                                1/2     Running     59 (2m15s ago)   13d
airflow-webserver-6f7756978f-bw8gh                                2/2     Running     0                13d
envoy-7759d655cf-588zl                                            1/1     Running     0                13d
hadoop-shell-7ffb86c579-8lctd                                     1/1     Running     0                13d
image-suggestions-weekly-wait-for-data-s3tv137a                   1/1     Running     0                128m
postgresql-airflow-search-1                                       1/1     Running     0                40d
postgresql-airflow-search-2                                       1/1     Running     0                40d
postgresql-airflow-search-pooler-rw-6db7749bd5-964hn              1/1     Running     0                40d
postgresql-airflow-search-pooler-rw-6db7749bd5-dsv9l              1/1     Running     0                40d
postgresql-airflow-search-pooler-rw-6db7749bd5-z6n7p              1/1     Running     0                40d
process-sparql-query-hourly-wait-for-data-h865p6fd                0/1     Completed   0                2m6s
process-sparql-query-hourly-wait-for-data-kh3947vb                0/1     Completed   0                2m5s
process-sparql-query-hourly-wait-for-data-sakrm52p                0/1     Completed   0                2m8s
subgraph-mapping-weekly-wait-for-data-2z2v1f8n                    0/1     Completed   0                117s
subgraph-query-mapping-daily-wait-for-data-3pqnmctc               0/1     Completed   0                118s
subgraph-query-metrics-daily-wait-for-data-4569eh27               0/1     Completed   0                2m
subgraph-query-metrics-daily-wait-for-data-haac0koj               0/1     Completed   0                2m1s
subgraph-query-metrics-daily-wait-for-data-rpzh6hlb               0/1     Completed   0                2m3s
wcqs-streaming-updater-reconcile-hourly-wait-for-event-ua57a7ic   0/1     Completed   0                116s
wdqs-streaming-updater-reconcile-hourly-wait-for-event-83uv1bh6   0/1     Error       0                40d

The symptoms are similar to those seen when we investigated here: https://wikimedia.slack.com/archives/C055QGPTC69/p1742325253186869

The outcome there was that we increased the resources available to the airflow-scheduler pod.
We tried patching the airflow-scheduler deployment, increasing the RAM to from 4GB to 8GB and the CPU from 1 to 4.

But it doesn't seem to have helped in this case.

We do have a Ceph cluster in warning state, which might be related. I checked on which host the scheduler was running and looked at the output from dmesg -T and it confirms that there were come kernel log messages mentioning Ceph.

[Tue Mar 25 18:46:54 2025] libceph: mon0 (1)[2620:0:861:109:10:64:130:13]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:54 2025] libceph: mon0 (1)[2620:0:861:109:10:64:130:13]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:54 2025] libceph: mon0 (1)[2620:0:861:109:10:64:130:13]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:54 2025] libceph: mon8 (1)[2620:0:861:10e:10:64:135:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:55 2025] libceph: mon8 (1)[2620:0:861:10e:10:64:135:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:55 2025] libceph: mon8 (1)[2620:0:861:10e:10:64:135:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:55 2025] libceph: mon0 (1)[2620:0:861:109:10:64:130:13]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:56 2025] libceph: mon8 (1)[2620:0:861:10e:10:64:135:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:57 2025] libceph: mon5 (1)10.64.132.23:6789 session established
[Tue Mar 25 18:46:57 2025] libceph: client18134227 fsid 6d4278e1-ea45-4d29-86fe-85b44c150813
[Tue Mar 25 18:46:58 2025] libceph: mon2 (1)[2620:0:861:10a:10:64:131:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:58 2025] libceph: mon2 (1)[2620:0:861:10a:10:64:131:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:46:59 2025] libceph: mon2 (1)[2620:0:861:10a:10:64:131:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:47:00 2025] libceph: mon2 (1)[2620:0:861:10a:10:64:131:21]:6789 socket closed (con state V1_BANNER)
[Tue Mar 25 18:47:01 2025] libceph: mon9 (1)10.64.135.21:6789 session established
[Tue Mar 25 18:47:01 2025] libceph: client18183189 fsid 6d4278e1-ea45-4d29-86fe-85b44c150813

It was related to the cephfs volume and seems to have been a repeat of this incident: T389470: CephOSDClusterInWarning

We went co cephosd1001 and executed:

btullis@cephosd1001:~$ sudo ceph -s

Which confirmed that it was slow requests again.
We found out which mds was affected by running:

btullis@cephosd1001:~$ sudo ceph -w

We then logged onto cephosd1005 and checked the output from:

systemctl status ceph-mds@cephosd1005.service

This confirmed that it was logging slow requests.
We then restarted the service with:

sudo systemctl restart ceph-mds@cephosd1005.service

This fixed the scheduler.

We reset the resources for the scheduler pod back to 1 CPU and 2GB of RAM, even though it is reasonable to have them higher. It just wasn't the cause of the issue, so we reset it for now.