Page MenuHomePhabricator

Investigate an incident where the airflow-test-k8s database was wiped
Closed, ResolvedPublic

Description

Whilst working on T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes we deployed an workload to the https://airflow-test-k8s.wikimedia.org/ instance, where it was intended to dump the database contents of 200 regular wikis to a Cephfs volume.

Some kind of incident occurred as a result of this, which resulted in the PostgreSQL database serving this instance to be deleted.

The webserver and scheduler pods were not available this morning. These pods were the only ones available.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-test-k8s$ k get pods
NAME                                                              READY   STATUS      RESTARTS   AGE
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-05dk1kwr   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-173tysdj   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-6po93igi   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-adcuijwi   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-c13i7m3v   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-cgw9cfxz   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-fcox70tc   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-igjmlqml   0/1     Completed   0          19h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-ir7f1qkx   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-iszs46cc   0/1     Error       0          10h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-rahe5r6j   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-sp6jk395   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-t1wwhg0c   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-u5256sm6   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-v3t8oq39   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-vix6js7w   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-whe4qygg   0/1     Error       0          8h
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-z0hq4drh   0/1     Error       0          8h
postgresql-airflow-test-k8s-1                                     1/1     Running     0          7h45m
postgresql-airflow-test-k8s-2                                     1/1     Running     0          7h44m
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-mpwgj            1/1     Running     0          7h45m
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-qrnq8            1/1     Running     0          7h45m
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-zdmkf            1/1     Running     0          7h45m

In order to redeploy the webserver and scheduler etc. I had to remove these mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-* pods, since they were hanging on to a persistent volume claim.

When the webserver came back up, all DAGs were paused and the custom pools that we had made from the Airflow UI were no longer present. All historical run data had also been removed, which indicates that the database had been wiped.

Note also that the database pods were only 7h45 minutes old. So if these pods were removed, perhaps by a resource contention isue, perhaps they removed the PVs that contained the database.

We can restore from a backup, but it is even more important to be able to prevent this from happening again.

Event Timeline

BTullis triaged this task as High priority.Apr 4 2025, 10:42 AM

We need to understand:

  • Why the cluster was removed and recreated
  • Whether we can prevent this from happening
  • Whether we can stop the PVs from being deleted when this happens..

I think that we want to set:

reclaimPolicy: retain

On all of our storageclasses.

root@deploy1003:~# kube-env admin dse-k8s-eqiad
root@deploy1003:~# kubectl get storageclasses
NAME                PROVISIONER           RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-cephfs-dumps   cephfs.csi.ceph.com   Delete          Immediate           true                   87d
ceph-cephfs-hdd     cephfs.csi.ceph.com   Delete          Immediate           true                   122d
ceph-cephfs-ssd     cephfs.csi.ceph.com   Delete          Immediate           true                   122d
ceph-rbd-ssd        rbd.csi.ceph.com      Delete          Immediate           true                   122d

https://kubernetes.io/docs/concepts/storage/storage-classes/#reclaim-policy

There was no particular resource contention in the airflow-test-k8s namespace overnight:
https://grafana.wikimedia.org/goto/va7Z6mANR?orgId=1

image.png (1×1 px, 300 KB)

Change #1134200 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Configure the ceph-csi-rbd storageclass to retain PVs

https://gerrit.wikimedia.org/r/1134200

I have created a patch to update the retention policy of the ceph-rbd-ssd storageclass, but I don't think that this will update the existing PVs.

root@deploy1003:~# kubectl get pv|head
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                                  STORAGECLASS        REASON   AGE
pvc-048e5481-7dbf-47de-9acb-d7ddd11b362c   15Gi       RWO            Delete           Bound    airflow-main/postgresql-airflow-main-2-wal                             ceph-rbd-ssd                 39d
pvc-04d56e24-a112-413f-b2f9-24e9cd7b8510   12Mi       RWX            Delete           Bound    airflow-platform-eng/airflow-kerberos-token-pvc                        ceph-cephfs-ssd              139d
pvc-050263c5-ca3f-41bc-8c4f-fcffebca4529   5Gi        RWO            Delete           Bound    airflow-analytics-test/postgresql-airflow-analytics-test-2             ceph-rbd-ssd                 130d
pvc-0a628edf-1993-494a-894d-7366fd0ac5b0   5Gi        RWO            Delete           Bound    airflow-analytics-test/postgresql-airflow-analytics-test-1             ceph-rbd-ssd                 130d
pvc-0e1a3c35-4a4d-40e6-8a8e-7e44c156d61f   15Gi       RWO            Delete           Bound    airflow-ml/postgresql-airflow-ml-1-wal                                 ceph-rbd-ssd                 113d
pvc-11721a33-7063-4657-b794-4f0d7754ac6d   15Gi       RWO            Delete           Bound    airflow-search/postgresql-airflow-search-1                             ceph-rbd-ssd                 84d
pvc-18eb3596-596e-4ed9-8fdc-037ab24881fe   15Gi       RWO            Delete           Bound    airflow-analytics-product/postgresql-airflow-analytics-product-2-wal   ceph-rbd-ssd                 36d
pvc-19635a7d-ec28-4151-9cb3-f288a7dc3771   15Gi       RWO            Delete           Bound    airflow-search/postgresql-airflow-search-2-wal                         ceph-rbd-ssd                 84d
pvc-1c904585-0a34-4fd4-aa91-ef8b4b4ab6a9   30Gi       RWO            Delete           Bound    airflow-main/postgresql-airflow-main-1                                 ceph-rbd-ssd                 39d

Happily, it seems that updating the retention policy of existing PVs is possible.
https://kubernetes.io/docs/tasks/administer-cluster/change-pv-reclaim-policy/

I patched all of the persistentvolumes for postgresql for the test-k8s instance.

root@deploy1003:~# kubectl get pv|grep postgresql-airflow-test-k8s
pvc-44992b68-4837-4b5c-995d-4ceee6081993   5Gi        RWO            Delete           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-2                         ceph-rbd-ssd                 10h
pvc-4a444ea8-aa42-422c-98d1-3df4acc7a5d7   5Gi        RWO            Delete           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-1                         ceph-rbd-ssd                 11h
pvc-59dce036-6a23-4d60-b7d3-19e104b13a3c   15Gi       RWO            Delete           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-2-wal                     ceph-rbd-ssd                 10h
pvc-72cb618c-89b8-46b1-9dff-cde4cf150f87   15Gi       RWO            Retain           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-1-wal                     ceph-rbd-ssd                 11h
root@deploy1003:~# kubectl patch pv pvc-44992b68-4837-4b5c-995d-4ceee6081993 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-44992b68-4837-4b5c-995d-4ceee6081993 patched
root@deploy1003:~# kubectl patch pv pvc-4a444ea8-aa42-422c-98d1-3df4acc7a5d7 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-4a444ea8-aa42-422c-98d1-3df4acc7a5d7 patched
root@deploy1003:~# kubectl patch pv pvc-59dce036-6a23-4d60-b7d3-19e104b13a3c -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-59dce036-6a23-4d60-b7d3-19e104b13a3c patched
root@deploy1003:~# kubectl patch pv pvc-72cb618c-89b8-46b1-9dff-cde4cf150f87 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
persistentvolume/pvc-72cb618c-89b8-46b1-9dff-cde4cf150f87 patched
root@deploy1003:~# kubectl get pv|grep postgresql-airflow-test-k8s
pvc-44992b68-4837-4b5c-995d-4ceee6081993   5Gi        RWO            Retain           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-2                         ceph-rbd-ssd                 11h
pvc-4a444ea8-aa42-422c-98d1-3df4acc7a5d7   5Gi        RWO            Retain           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-1                         ceph-rbd-ssd                 11h
pvc-59dce036-6a23-4d60-b7d3-19e104b13a3c   15Gi       RWO            Retain           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-2-wal                     ceph-rbd-ssd                 11h
pvc-72cb618c-89b8-46b1-9dff-cde4cf150f87   15Gi       RWO            Retain           Bound    airflow-test-k8s/postgresql-airflow-test-k8s-1-wal                     ceph-rbd-ssd                 11h

It looks like things started going wrong at 00:11 UTC. This is from the logs of the cloudnative-pg-operator pod.

{"level":"info","ts":"2025-04-04T00:01:31Z","msg":"Next backup schedule","controller":"scheduledbackup","controllerGroup":"postgresql.cnpg.io","controllerKind":"ScheduledBackup","ScheduledBackup":{"name":"postgresql-airflow-ml-daily-backup","namespace":"airflow-ml"},"namespace":"airflow-ml","name":"postgresql-airflow-ml-daily-backup","reconcileID":"2522a920-8d6c-4ee0-9d9b-f1348a6ed272","next":"2025-04-05T00:00:00Z"}

{"level":"info","ts":"2025-04-04T00:11:37Z","msg":"Resource has been deleted","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgresql-airflow-test-k8s","namespace":"airflow-test-k8s"},"namespace":"airflow-test-k8s","name":"postgresql-airflow-test-k8s","reconcileID":"dcdba6af-d548-487a-8554-a4c6f80a2a0c"}
{"level":"info","ts":"2025-04-04T00:11:37Z","msg":"Resource has been deleted","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgresql-airflow-test-k8s","namespace":"airflow-test-k8s"},"namespace":"airflow-test-k8s","name":"postgresql-airflow-test-k8s","reconcileID":"183e9502-da23-4f40-81b9-b7e28bab3970"}
{"level":"info","ts":"2025-04-04T00:11:37Z","msg":"Cluster not found, will retry in 30 seconds","controller":"pooler","controllerGroup":"postgresql.cnpg.io","controllerKind":"Pooler","Pooler":{"name":"postgresql-airflow-test-k8s-pooler-rw","namespace":"airflow-test-k8s"},"namespace":"airflow-test-k8s","name":"postgresql-airflow-test-k8s-pooler-rw","reconcileID":"cde8a255-f6b3-4d95-b57c-41dfc4ad6bfe","cluster":"postgresql-airflow-test-k8s"}

It had been doing backups at 00:00 and the next log entry at 00:11:37 says: Resource has been deleted and the name of the resource is postgresql-airflow-test-k8s

There are some messages like this, which suggest that existing PVCs are skipped.

{"level":"info","ts":"2025-04-04T00:14:50Z","msg":"skipping pvc because it has owner metadata","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgresql-airflow-test-k8s","namespace":"airflow-test-k8s"},"namespace":"airflow-test-k8s","name":"postgresql-airflow-test-k8s","reconcileID":"210eaf1a-80a5-4afb-8270-fa4bdc0b8410","step":"get_orphan_pvcs","pvcName":"postgresql-airflow-test-k8s-2"}

But I'm not sure if that indicates a problem or not.
Generally, the logs from the cloudnative-pg-operator seem to show that it's working correctly. It had detected that the postgresql cluster had been deleted, so the next question is why.

I tried deleting the cluster and re-deploying it with:

kubectl delete cluster postgresql-airflow-test-k8s

helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-test-k8s' delete
helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-test-k8s' apply

I had hoped to see the operator reusing the existing PVs, but it looks like it released the previous PVs and created 4 new ones.

root@deploy1003:~# kubectl get pv|grep postgresql-airflow-test-k8s
pvc-44992b68-4837-4b5c-995d-4ceee6081993   5Gi        RWO            Retain           Released   airflow-test-k8s/postgresql-airflow-test-k8s-2                         ceph-rbd-ssd                 12h
pvc-4a444ea8-aa42-422c-98d1-3df4acc7a5d7   5Gi        RWO            Retain           Released   airflow-test-k8s/postgresql-airflow-test-k8s-1                         ceph-rbd-ssd                 12h
pvc-59dce036-6a23-4d60-b7d3-19e104b13a3c   15Gi       RWO            Retain           Released   airflow-test-k8s/postgresql-airflow-test-k8s-2-wal                     ceph-rbd-ssd                 12h
pvc-72cb618c-89b8-46b1-9dff-cde4cf150f87   15Gi       RWO            Retain           Released   airflow-test-k8s/postgresql-airflow-test-k8s-1-wal                     ceph-rbd-ssd                 12h
pvc-8194a837-799f-4f2e-9dfa-760b78c75fad   5Gi        RWO            Delete           Bound      airflow-test-k8s/postgresql-airflow-test-k8s-1                         ceph-rbd-ssd                 15m
pvc-87c1589c-2b2a-4c36-a613-84fe293e856c   15Gi       RWO            Delete           Bound      airflow-test-k8s/postgresql-airflow-test-k8s-1-wal                     ceph-rbd-ssd                 15m
pvc-8bd7da9f-71d9-4f30-988e-fcdc561147bb   5Gi        RWO            Delete           Bound      airflow-test-k8s/postgresql-airflow-test-k8s-2                         ceph-rbd-ssd                 14m
pvc-9a67e309-026d-4757-a7ff-81c9342e6f34   15Gi       RWO            Delete           Bound      airflow-test-k8s/postgresql-airflow-test-k8s-2-wal                     ceph-rbd-ssd                 14m

That's not great.

I did a point-in-time recovery from the object store and it worked.

  • I went to a deployment server
  • Checked out a copy of deployment-charts into my home directory: git clone "https://gerrit.wikimedia.org/r/operations/deployment-charts"
  • Went to the deployment directory: cd deployment-charts/helmfile.d/dse-k8s-services/airflow-test-k8s/
  • Deleted the deployments with: 'helmfile -e dse-k8s-eqiad delete'
  • Edited the empty file: values-postgresql-airflow-test-k8s.yaml and added the following content.
mode: recovery
recovery:
  method: object_store
  clusterName: postgresql-airflow-test-k8s
  source: clusterBackup

externalClusters:
  - name: clusterBackup
    barmanObjectStore:
      wal:
        compression: gzip
        encryption: ''
        maxParallel: 1
      data:
        compression: gzip
        encryption: ''
        jobs: 2
    
      endpointURL: "https://rgw.eqiad.dpe.anycast.wmnet"
      endpointCA:
        name: postgresql-airflow-test-k8s-ca-bundle
        key: ca-bundle.crt 
      destinationPath: "s3://postgresql-airflow-test-k8s.dse-k8s-eqiad/"
      s3Credentials:
        accessKeyId:
          name: postgresql-airflow-test-k8s-backup-s3-creds
          key: "{{ $.Values.backups.s3.access_key }}"
        secretAccessKey:
          name: postgresql-airflow-test-k8s-backup-s3-creds
          key: "{{ $.Values.backups.s3.secret_key }}"

backups:
  enabled: false
  • Did a helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-test-k8s' diff followed by a helmfile -e dse-k8s-eqiad --selector 'name=postgresql-airflow-test-k8s' apply
  • I watched the fullrecovery pods stream back the base backup and WALs that had been taken at midnight.
  • All looked good, so I emptied the values-postgresql-airflow-test-k8s.yaml again and redeployed again.
  • I then redeployed Airflow and it was back to how it was before the incident.

It happened again. The job that was dumping 200 mediawiki pods was restored to a state where it was still running, so after 36 minutes the Airflow instance seems to have been borked again.

btullis@deploy1003:~/deployment-charts/helmfile.d/dse-k8s-services/airflow-test-k8s$ k get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
airflow-gitsync-6f6d7cc97c-2stwf                                  0/1     Pending   0          115s
airflow-kerberos-79b6b964dc-69pll                                 0/1     Pending   0          115s
airflow-scheduler-6db68bb655-j4zp9                                0/2     Pending   0          115s
airflow-webserver-86985dfc6b-7ws9z                                0/2     Pending   0          115s
envoy-6c5f78d89b-6bzjc                                            1/1     Running   0          115s
hadoop-shell-74dd465b5-s25k7                                      0/1     Pending   0          114s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-2t7jqjaf   0/1     Error     0          12m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-2wnly56r   0/1     Error     0          14m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-3kwohs4e   0/1     Error     0          8m30s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-40dn8ex0   0/1     Error     0          5m3s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-6uoxnxe3   0/1     Error     0          13m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-77r9ub6x   0/1     Error     0          7m23s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-9nhm7ixg   0/1     Error     0          14m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-ansf2h5i   0/1     Error     0          9m4s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-ek3uym5j   0/1     Error     0          11m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-frnqbqwa   0/1     Error     0          6m48s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-hosx8ez4   0/1     Error     0          10m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-i3n8x893   0/1     Error     0          4m27s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-j348jvqa   0/1     Error     0          11m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-jbws1uv5   0/1     Error     0          7m55s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-lf9lq36p   0/1     Error     0          6m13s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-n4kf84gl   0/1     Error     0          13m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-oaskc3wx   1/1     Running   0          3m53s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-os8qj9a9   0/1     Error     0          10m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-qxx3v18r   0/1     Error     0          9m38s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-uaimrkwd   0/1     Error     0          13m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-up5ajviy   0/1     Error     0          15m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-vl1jd44h   0/1     Error     0          5m39s
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-xjqfm9jx   0/1     Error     0          11m
mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-xkovdzaz   0/1     Error     0          12m
postgresql-airflow-test-k8s-1                                     1/1     Running   0          85s
postgresql-airflow-test-k8s-2                                     1/1     Running   0          40s
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-2n88j            1/1     Running   0          2m
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-4xj9x            1/1     Running   0          2m
postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-wvwq9            1/1     Running   0          2m

Note that the two database servers are only 85s and 40s old, plus the main airflow components are all in a pending state.

We have some PostgreSQL errors from the running job.

btullis@deploy1003:~/deployment-charts/helmfile.d/dse-k8s-services/airflow-test-k8s$ k logs mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-oaskc3wx
/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/metrics/statsd_logger.py:184 RemovedInAirflow3Warning: The basic metric validator will be deprecated in the future in favor of pattern-matching.  You can try this now by setting config option metrics_use_pattern_match to True.
[2025-04-04T16:36:17.706+0000] {cli_action_loggers.py:177} WARNING - Failed to log action (psycopg2.OperationalError) connection to server at "10.67.46.9", port 5432 failed: Connection timed out
	Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2025-04-04T16:36:17.732+0000] {__init__.py:24} WARNING - 
        OpenLineage support for Airflow version 2.10.5 is REMOVED.
        For Airflow 2.7 and later, use the native Airflow Openlineage provider package.
        Documentation can be found at https://airflow.apache.org/docs/apache-airflow-providers-openlineage
        
[2025-04-04T16:36:18.582+0000] {utils.py:434} WARNING - No module named 'paramiko'
[2025-04-04T16:36:18.591+0000] {utils.py:434} WARNING - No module named 'airflow.providers.dbt'
[2025-04-04T16:36:18.687+0000] {base.py:84} INFO - Retrieving connection 'datahub_gms'
[2025-04-04T16:36:18.694+0000] {base.py:84} INFO - Retrieving connection 'datahub_gms'
[2025-04-04T16:36:18.694+0000] {datahub_listener.py:143} INFO - DataHub plugin v2 using DataHubRestEmitter: configured to talk to http://datahub-gms-staging.datahub-next.svc:8080
[2025-04-04T16:36:18.700+0000] {dagbag.py:588} INFO - Filling up the DagBag from /opt/airflow/dags/airflow_dags/test_k8s/dags/mediawiki_sql_xml_dumps.py
[2025-04-04T16:36:18.702+0000] {cli.py:251} WARNING - Dag 'mediawiki_sql_xml_regular_dumps' not found in path /opt/airflow/dags/airflow_dags/test_k8s/dags/mediawiki_sql_xml_dumps.py; trying path /opt/airflow/dags/airflow_dags/main
[2025-04-04T16:36:18.702+0000] {dagbag.py:588} INFO - Filling up the DagBag from /opt/airflow/dags/airflow_dags/main
[2025-04-04T16:36:19.249+0000] {dag_default_args.py:264} INFO - Dag-default-args set is `wmf`.
[2025-04-04T16:36:19.313+0000] {dag_default_args.py:264} INFO - Dag-default-args set is `wmf`.
[2025-04-04T16:38:28.778+0000] {timeout.py:68} ERROR - Process timed out, PID: 1
[2025-04-04T16:38:28.778+0000] {dagbag.py:387} ERROR - Failed to import: /opt/airflow/dags/airflow_dags/main/dags/webrequest/refine_webrequest_hourly_dag.py
Traceback (most recent call last):
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "10.67.46.9", port 5432 failed: Connection timed out
	Is the server running on that host and accepting TCP/IP connections?

Our latest thinking is that this incident was caused by human error, rather than a systems failure. In a way, this is very reassuring, but it also raises challenging questions.

The core issue seems to stem from a slight misunderstanding of how the airflow and postgresql deployments are related to each other and what the expected outcome would be from destroying and re-creating both deployments.
In addition to this, we have been encouraging users to make ad-hoc deployment to either the analytics-test and test-k8s Airflow instances, as part of the migration of DAGs from the analytics to the main instance.

Part of the workflow for this migration seems to have been for users to do the following:

  • take a copy of the deployment-charts repository into their home directory on a deployment server
  • make a local change to update the source branch that the airflow-gitsync pod was pulling from the airflow-dags repository
  • deploy this to the dse-k8s-eqiad cluster in order to test that the DAG functions in the Kubernetes environment
  • address any problems in the DAG, using a fast-feedback-loop and pushing changes to their feature branch, which was synced frequently to the instance on which they were testing

Users are also able to deploy airflow and use kubectl as well as helmfile to troubleshoot any issues with their DAG runs.

However, there has been a problem due to the fact that multiple teams believed that they could have exclusive use of the test-k8s instance.
The original purpose of this instance was to be a facility for the Data-Platform-SRE team to test our Airflow on Kubernetes infrastructure, including such features as the CloudnativePG database, the KubernetesPodOperator, the S3 logging etc.
We also deploy a number of integration testing DAGs, which test functionality such as sending email, writing logs to S3, launching a SparkApplication on Kubernetes, launching a Spark Application on Skein/YARN etc.

In recent weeks, @brouberol and I have been using it extensively to carry out work on T388378: Orchestrate dumps v1 from an airflow instance in support of its associated epic: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes
The most recent task that we undertook was to launch a dump of 200 wikis to a cephfs volume. This was a heavy job that we knew would possibly run for days and was exposing certain performance problems in the Airflow scheduler setup.
We were aware of this fact and knew that the test-k8s instance was operating somewhere between unusable and unavailable, but it was still a useful test case for us.

What seems to have happened in this case is that @amastilovic (and I do not wish to cast blame here) wanted to do some work on T390249: Migrate Gobblin to Airflow and decided to use the test-k8s instance for it.
He switched the branch from main to a feature branch, but may have become aware of instability in the scheduler, since it was really busy with dumping 200 wikis.

Anyway, at some point Aleksander executed the following on deploy2002.

git clone https://gerrit.wikimedia.org/r/operations/deployment-charts
cd deployment-charts/
cd helmfile.d/dse-k8s-services/airflow-test-k8s/
helmfile -e dse-k8s-eqiad -i destroy

The effect of this would have been to delete not only the airflow instance, which is stateless, but the postgresql cluster supporting it, which is stateful.

This explains the log messages from the cloudnativepg operator, which stated that the database cluster had been deleted.

So I think that we have several things to think about as to how to prevent this kind of incident from happening on a more important Airflow instance.

I'll follow up with some suggestions for discussion.

Just wanted to confirm that the assumptions @BTullis made here about what actions I performed and why, are indeed correct.

There's an obvious need for multiple different testing modalities of our Airflow Kubernetes deployments and I think we should address it as soon as possible. I'm looking forward to seeing Ben's suggestions.

Change #1134200 abandoned by Btullis:

[operations/deployment-charts@master] Configure the ceph-csi-rbd storageclass to retain PVs

Reason:

We don't need to do this at the storageclass level, as we use streaming WAL backup.

https://gerrit.wikimedia.org/r/1134200

Closing this ticket, a since service has been fully restored and the root cause identified.
There might still be some more follow-up work, so I'll just link them to this ticket.