Page MenuHomePhabricator

Deploy the dumps monitor script, which generates the html status files, to the dse-k8s cluster
Closed, ResolvedPublic

Description

On the bare metal snapshot hosts, we currently have a dumps monitor process that is configured to run on snapshot1010.

The function of that script is described here: https://wikitech.wikimedia.org/wiki/Dumps/Current_Architecture#Monitor_node

On one host, the monitor script runs a python script for each wiki that checks for and removes stale lock files from dump processes that have died, and updates the central index.html file which shows the dumps in progress and the status of the dumps that have completed (i.e. http://dumps.wikimedia.org/backup-index.html ).

We need to migrate this functionality to the dse-k8s cluster.

Event Timeline

BTullis triaged this task as High priority.

My first attempt to run the script in the toolbox was OK. No errors shown. I was also able to run this from the /var/www directory, rather than /srv/deployment/dumps/dumps/xmldumps-backup, which is a bonus.

www-data@mediawiki-dumps-legacy-toolbox-5b847f4557-mkxds:~$ python3 /srv/deployment/dumps/xmldumps-backup/monitor.py /etc/dumps/confs/wikidump.conf.dumps:monitor

www-data@mediawiki-dumps-legacy-toolbox-5b847f4557-mkxds:~$ find /mnt/dumpsdata/xmldatadumps/ \( -name backup-index-bydb.html -o -name backup-index.html -o -name index.json \) -exec ls -l {} \;
-rw-r--r-- 1 www-data www-data 5449 Mar 28 14:55 /mnt/dumpsdata/xmldatadumps/public/backup-index.html
-rw-r--r-- 1 www-data www-data 94351 Mar 28 14:55 /mnt/dumpsdata/xmldatadumps/public/index.json
-rw-r--r-- 1 www-data www-data 5444 Mar 28 14:55 /mnt/dumpsdata/xmldatadumps/public/backup-index-bydb.html

I bypassed the bash process that just sets up the while loop: /srv/deployment/dumps/dumps/xmldumps-backup/monitor

So I think that this is a good start. It looks like we can just use a CronJob object to run this every 5 minutes, or something similar.

We should be aware that this does several other this, as well as writing the index files.

  • cleanup_stale_dumplocks
  • cleanup_stale_batch_jobfiles

I will check to see whether this is functionality that we want to keep when running the dumps under Airflow, or whether we can just choose to skip it and handle these requirements a different way.

Should we make it the main process of the toolbox?

Should we make it the main process of the toolbox?

It'd definitely an option, but I'm wondering if we really need it to be a daemon at all.
The while true loop here is just launching a script and then sleeping, so I was pondering if we would be better just running the script on a schedule.
We could set the concurrency policy to Forbid to prevent concurrent runs.

But if you think that it would be good to run it as the main process of the toolbox, we could do that.

Random thought: if we're thinking about having a regular Job with a max concurrency level perform operations on the dumps "platform", shouldn't we run it from airflow itself? This would make them dumps v1 observable in a "single pane of glass". WDYT?

Random thought: if we're thinking about having a regular Job with a max concurrency level perform operations on the dumps "platform", shouldn't we run it from airflow itself? This would make them dumps v1 observable in a "single pane of glass". WDYT?

I think you're probably right. Probably better than getting one or more CronJob objects involved.
We could have three tasks:

  • Reuse the fetch_job_pod_spec task from the main dumps DAG
  • Run the python3 /srv/deployment/dumps/xmldumps-backup/monitor.py /etc/dumps/confs/wikidump.conf.dumps:monitor command
  • Run parallel-rsync -f ~/rsync_targets -a -x '--include *.html --include *.json' /mnt/dumpsdata/xmldatadumps/public/ /srv/mediawiki-dumps-legacy/xmldatadumps/public (or something similar)

Then we could schedule this every few minutes or so.

This is now starting to work properly.
My test run of the DAG has successed.

image.png (475×1 px, 79 KB)

The three files that are expected are present on the cephfs volume.

www-data@mediawiki-dumps-legacy-toolbox-c465d7598-96d97:/mnt/dumpsdata/xmldatadumps/public$ ls -lrt|tail -n 3
-rw-r--r--  1 www-data www-data   21113 May 12 09:13 backup-index.html
-rw-r--r--  1 www-data www-data   21108 May 12 09:13 backup-index-bydb.html
-rw-r--r--  1 www-data www-data 1719151 May 12 09:13 index.json
www-data@mediawiki-dumps-legacy-toolbox-c465d7598-96d97:/mnt/dumpsdata/xmldatadumps/public$

And those files have now been correctly synced to the clouddumps servers.

btullis@clouddumps1002:/srv/mediawiki-dumps-legacy/xmldatadumps/public$ ls -lrt|tail -n 3
-rw-r--r--  1 dumpsgen dumpsgen   21113 May 12 09:13 backup-index.html
-rw-r--r--  1 dumpsgen dumpsgen   21108 May 12 09:13 backup-index-bydb.html
-rw-r--r--  1 dumpsgen dumpsgen 1719151 May 12 09:13 index.json
btullis@clouddumps1002:/srv/mediawiki-dumps-legacy/xmldatadumps/public$

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1301

test_k8s/dumps_monitor: ensure a single monitor DAG is running at any given time