Update Dec 2024
As per the APP FY24-25: OKRs and Hypotheses spreadsheet, this work has now been associated with a hypothesis in the WE 5.4 KR. The hypothesis has the ID of 5.4.4 and the wording is as follows:
If we decouple the legacy dumps processes from their current bare-metal hosts and instead run them as workloads on the DSE Kubernetes cluster, this will bring about demonstrable benefit to the maintainability of these data pipelines and facilitate the upgrade of PHP to version 8.1 by using shared mediawiki containers.
The ideal solution is still to schedule the dumps with Airflow, but this wording of the hypothesis deliberately makes no mention of this, since we do not wish to place Airflow integration on the critical path to achieving the stated objective by April 1st 2025. The option of using Kubernetes CronJob objects (or similar) would meet the objective.
Original ticket description below
While there's ongoing work to create a new architecture for dumps, we still need to support the current version. In concise terms, what happens now is that a set python scripts launch periodically some mediawiki maintenance scripts and the (very large) output of those is written to an NFS volume that is mounted from the dumpsdata host.
After some consideration of other plans that were outlined in this task originally, we've decided to actually port the dumps to run on k8s, as outlined .
In short (for the longer version see Ben's document) we want to do the following:
- Create a new MediaWiki image that includes a python interpreter and the dumps code
- Allow the MediaWiki chart to include a PersitentVolume mounted under /mnt/dumpsdata and the dumps configuration (now in puppet) under /etc/dumps/
- Run a MediaWiki deployment on the DSE k8s cluster using a specific ceph volume to write the dumps to, substituting the function of NFS, running the python scripts as CronJob in a first implementation, moving to airflow later
- Run a dependent job to copy the dumps files over to the server that will serve them to the public