Whilst working on T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes we deployed an workload to the https://airflow-test-k8s.wikimedia.org/ instance, where it was intended to dump the database contents of 200 regular wikis to a Cephfs volume.
Some kind of incident occurred as a result of this, which resulted in the PostgreSQL database serving this instance to be deleted.
The webserver and scheduler pods were not available this morning. These pods were the only ones available.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/dse-k8s-services/airflow-test-k8s$ k get pods NAME READY STATUS RESTARTS AGE mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-05dk1kwr 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-173tysdj 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-6po93igi 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-adcuijwi 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-c13i7m3v 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-cgw9cfxz 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-fcox70tc 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-igjmlqml 0/1 Completed 0 19h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-ir7f1qkx 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-iszs46cc 0/1 Error 0 10h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-rahe5r6j 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-sp6jk395 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-t1wwhg0c 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-u5256sm6 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-v3t8oq39 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-vix6js7w 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-whe4qygg 0/1 Error 0 8h mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-z0hq4drh 0/1 Error 0 8h postgresql-airflow-test-k8s-1 1/1 Running 0 7h45m postgresql-airflow-test-k8s-2 1/1 Running 0 7h44m postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-mpwgj 1/1 Running 0 7h45m postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-qrnq8 1/1 Running 0 7h45m postgresql-airflow-test-k8s-pooler-rw-6dbb4bbb49-zdmkf 1/1 Running 0 7h45m
In order to redeploy the webserver and scheduler etc. I had to remove these mediawiki-sql-xml-regular-dumps-run-wiki-dump-jobs-dum-* pods, since they were hanging on to a persistent volume claim.
When the webserver came back up, all DAGs were paused and the custom pools that we had made from the Airflow UI were no longer present. All historical run data had also been removed, which indicates that the database had been wiped.
Note also that the database pods were only 7h45 minutes old. So if these pods were removed, perhaps by a resource contention isue, perhaps they removed the PVs that contained the database.
We can restore from a backup, but it is even more important to be able to prevent this from happening again.
