==Update Dec 2024
As per the [[https://docs.google.com/spreadsheets/d/18Wh1bBR3Eq7NX_uM9QVKUGJftjzly2j7HdbKCuyG_sw/edit?gid=969299336#gid=969299336|APP FY24-25: OKRs and Hypotheses]] spreadsheet, this work has now been associated with a hypothesis in the **WE 5.4** KR. The hypothesis has the ID of **5.4.4** and the wording is as follows:
> If we decouple the legacy dumps processes from their current bare-metal hosts and instead run them as workloads on the DSE Kubernetes cluster, this will bring about demonstrable benefit to the maintainability of these data pipelines and facilitate the upgrade of PHP to version 8.1 by using shared mediawiki containers.
The //ideal// solution is still to schedule the dumps with Airflow, but this wording of the hypothesis deliberately makes no mention of this, since we do not wish to place Airflow integration on the critical path to achieving the stated objective by April 1st 2025. The option of using Kubernetes //CronJob// objects (or similar) would meet the objective.
== Original ticket description below
While there's ongoing work to create a new architecture for dumps, we still need to support the current version. In concise terms, what happens now is that a set python scripts launch periodically some mediawiki maintenance scripts and the (very large) output of those is written to an NFS volume that is mounted from the `dumpsdata` host.
After some consideration of other plans that were outlined in this task originally, we've decided to actually port the dumps to run on k8s, as outlined .
In short (for the longer version see [[ https://docs.google.com/document/d/1jG89YmTC4RyVztgvPF_sXtmcHichB9u9A7KYyew5W3c/edit?tab=t.0#heading=h.vyt1m0p2t0j6 | Ben's document ]]) we want to do the following:
* Create a new MediaWiki image that includes a python interpreter and the dumps code
* Allow the MediaWiki chart to include a PersitentVolume mounted under `/mnt/dumpsdata` and the dumps configuration (now in puppet) under `/etc/dumps/`
* Run a MediaWiki deployment on the DSE k8s cluster using a specific ceph volume to write the dumps to, substituting the function of NFS, running the python scripts as `CronJob` in a first implementation, moving to airflow later
* Run a dependent job to copy the dumps files over to the server that will serve them to the public