Page MenuHomePhabricator

WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes
Closed, ResolvedPublic

Description

Update Dec 2024

As per the APP FY24-25: OKRs and Hypotheses spreadsheet, this work has now been associated with a hypothesis in the WE 5.4 KR. The hypothesis has the ID of 5.4.4 and the wording is as follows:

If we decouple the legacy dumps processes from their current bare-metal hosts and instead run them as workloads on the DSE Kubernetes cluster, this will bring about demonstrable benefit to the maintainability of these data pipelines and facilitate the upgrade of PHP to version 8.1 by using shared mediawiki containers.

The ideal solution is still to schedule the dumps with Airflow, but this wording of the hypothesis deliberately makes no mention of this, since we do not wish to place Airflow integration on the critical path to achieving the stated objective by April 1st 2025. The option of using Kubernetes CronJob objects (or similar) would meet the objective.

Original ticket description below

While there's ongoing work to create a new architecture for dumps, we still need to support the current version. In concise terms, what happens now is that a set python scripts launch periodically some mediawiki maintenance scripts and the (very large) output of those is written to an NFS volume that is mounted from the dumpsdata host.

After some consideration of other plans that were outlined in this task originally, we've decided to actually port the dumps to run on k8s, as outlined .

In short (for the longer version see Ben's document) we want to do the following:

  • Create a new MediaWiki image that includes a python interpreter and the dumps code
  • Allow the MediaWiki chart to include a PersistentVolume mounted under /mnt/dumpsdata and the dumps configuration (now in puppet) under /etc/dumps/
  • Run a MediaWiki deployment on the DSE k8s cluster using a specific ceph volume to write the dumps to, substituting the function of NFS, running the python scripts as CronJob in a first implementation, moving to airflow later
  • Run a dependent job to copy the dumps files over to the server that will serve them to the public

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedBTullis
ResolvedJoe
ResolvedClement_Goubert
ResolvedScott_French
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedRequestVRiley-WMF
ResolvedRequestNone
Resolvedbd808
ResolvedVRiley-WMF
ResolvedBTullis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have added a patch to the mediawiki chart to allow claiming persistent volumes for dumps specifically. @BTullis I guess when both this patch and the one to add the needed stuff to the debug image are ready, I think you'll be ready to create a deployment on the DSE cluster.

I think it might make sense to also link the files under helmfile.d/services/_mediawiki-common_ in your space, as we definitely want to keep the settings in sync.

Let me know if you need further assistance in setting that up.

Thanks @Joe - I think that looks useful.

I'd also like to catch up on the current work with regard to how this chart is being used to deploy Job objects, if possible.
I see various references to mwscript and mercurius here, so if you (or someone) could let me know the current direction of travel regarding these two things, I'd be grateful.

[...]

I see various references to mwscript and mercurius here, so if you (or someone) could let me know the current direction of travel regarding these two things, I'd be grateful.

Not sure that's the right file, but Mercurius and mwscript are two Job-type deployments of mediawiki that don't include httpd. Yours would be a third one. You'd also have to add the command that's going to be ran to the MediaWiki container part of the templates/lamp/deployment.yaml.tpl file, and exclude dumps from the liveness probe in that same file.

The rest of the sidecars etc. can be configured from the values.yaml files.

I'm in the process of refactoring the mediawiki chart, but I don't want to block you moving forwards, so I'll port any changes you make when I have something that works.

You probably also want to if-guard the service definition in templates/service.yaml.tpl

BTullis renamed this task from Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes.Dec 19 2024, 12:30 PM
BTullis updated the task description. (Show Details)
BTullis set Due Date to Apr 1 2025, 8:00 AM.

For reference, here is the full list of jobs that are understood by the dumps worker process.

  • noop (runs no job but rewrites checksums files andresets latest links)
  • latestlinks (runs no job but resets latest links)
  • createdirs (runs no job but creates dump dirs for the given date)
  • tables (includes all items below that end in 'table')
    • sitestatstable
    • imagetable
    • pagelinkstable
    • categorylinkstable
    • imagelinkstable
    • templatelinkstable
    • linktargettable
    • externallinkstable
    • langlinkstable
    • usergroupstable
    • userformergroupstable
    • categorytable
    • pagetable
    • pagerestrictionstable
    • pagepropstable
    • protectedtitlestable
    • redirecttable
    • iwlinkstable
    • geotagstable
    • changetagstable
    • changetagdeftable
    • flaggedpagestable
    • flaggedrevstable
    • flaggedpageconfigtable
    • flaggedrevspromotetable
    • flaggedrevsstatisticstable
    • flaggedrevstrackingtable
    • sitestable
    • wbcentityusagetable
    • babeltable
  • namespaces
  • namespacesv2
  • pagetitlesdump
  • allpagetitlesdump
  • abstractsdump
  • abstractsdumprecombine
  • xmlstubsdump
  • xmlstubsdumprecombine
  • articlesdump
  • articlesdumprecombine
  • metacurrentdump
  • metacurrentdumprecombine
  • xmlpagelogsdump
  • xmlpagelogsdumprecombine
  • metahistorybz2dump
  • metahistory7zdump
  • articlesmultistreamdump
  • articlesmultistreamdumprecombine

I want to make sure that I understand the logic of when these get called by the worker and/or scheduler process, so we can work out an efficient command to call for each of the deployments orjobs that we create.

BTullis renamed this task from WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.Jan 7 2025, 4:11 PM

Change #1114001 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] mediwiki-dumps-legacy: Create helmfile deployment of a suspended job

https://gerrit.wikimedia.org/r/1114001

I have had some more thoughts about how to get this to work and I have made a few changes to this mediawiki chart patch to support it.

Also, I started work on a helmfile.d/dse-k8s-services/mediwiki-dumps-legacy release, to accompany it.

The overall idea is this:

  • We do a standard helmfile -e dse-k8s-eqiad apply in the dse-k8s-services/mediawiki-dumps-legacy directory.
  • This creates the following objects from the mediawiki chart in the mediawiki-dumps-legacy namespace:
    • networkpolicy: mediawiki-production
    • configmaps:
      • mediawiki-production-envoy-config-volume
      • mediawiki-production-wikimedia-cluster-config
      • mediawiki-production-wmerrors
      • mediawiki-production-rsyslog-config
      • (Probably the mcrouter config as well)
    • persistentvolumeclaim: mediawiki-dumps-legacy-fs
    • job: mediawiki-production-dumps-job-template
  • We also need to get the files under /etc/dumps/confs and /etc/dumps/dblists into the container, so maybe we need a dumps chart for this.

The mediawiki-production-dumps-job-template job will be a suspended job, so it will not actually run anything, but it will contain a pod spec for running jobs.

Then, when we start the airflow DAG to carry out the dumps, the first task will be to use kubectl or curl to describe the mediawiki-production-dumps-job-template and extract the spec.template from it.
This pod template will then be saved as an XCom, which will be used by subsequent tasks using the KubernetesPodOperator.

We can also generate an XCom which is a list of wikis, then we can create a task group, which is a worker process for each of the wikis,

If we do this, then I think that we can avoid having to use the custom looping over all wikis in worker.py.

Change #1104605 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add support for dumps suspended job

https://gerrit.wikimedia.org/r/1104605

Change #1114001 merged by jenkins-bot:

[operations/deployment-charts@master] mediwiki-dumps-legacy: Create helmfile deployment of a suspended job

https://gerrit.wikimedia.org/r/1114001

Change #1121587 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki: bump version

https://gerrit.wikimedia.org/r/1121587

Change #1121587 merged by Brouberol:

[operations/deployment-charts@master] mediawiki: bump version

https://gerrit.wikimedia.org/r/1121587

The Cronjob and PVC have been deployed!

brouberol@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/mediawiki-dumps-legacy$ kube_env mediawiki-dumps-legacy-deploy dse-k8s-eqiad
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/mediawiki-dumps-legacy$ kubectl get pvc -n mediawiki-dumps-legacy mediawiki-dumps-legacy-fs
NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
mediawiki-dumps-legacy-fs   Bound    pvc-94490633-4041-4985-bb13-74e8764c71e6   10Gi       RWX            ceph-cephfs-dumps   53s
brouberol@deploy2002:/srv/deployment-charts/helmfile.d/dse-k8s-services/mediawiki-dumps-legacy$ kubectl get all
NAME                                                COMPLETIONS   DURATION   AGE
job.batch/mediawiki-production-dumps-job-template   0/1                      69s

I'll let @BTullis assess whether we can close.

I believe the configmaps mentioned in T352650#10493191 still need to be implemented, as well as I think these two things:

We also need to get the files under /etc/dumps/confs and /etc/dumps/dblists into the container, so maybe we need a dumps chart for this.
We can also generate an XCom which is a list of wikis, then we can create a task group, which is a worker process for each of the wikis,

I'm basing that off of not seeing associated patches and not seeing the configmaps present in the k8s namespace. We'll have Ben back this next week to confirm if my understanding is correct, and if so we can start work on those pieces.

Change #1127879 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] mediawiki: Update the dumps job template to support write access

https://gerrit.wikimedia.org/r/1127879

Change #1127916 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] mediawiki: Use the servergroup to configure the dumps feature flag

https://gerrit.wikimedia.org/r/1127916

dancy merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/153

Add the bzip2 package to the mediawiki-debug image for legacy dumps

Change #1127879 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Update the dumps job template to support write access

https://gerrit.wikimedia.org/r/1127879

Change #1127916 abandoned by Btullis:

[operations/deployment-charts@master] mediawiki: Use the servergroup to configure the dumps feature flag

Reason:

No longer required.

https://gerrit.wikimedia.org/r/1127916

BTullis claimed this task.

I have created a sub-ticket of T398438: Decommission or recommission all snapshot and dumpsdata servers - although I don't propose that we let it dfer our closing of this epic.

Similarly, I created T398436: Future improvements to the Dumps on Airflow configuration - tracking task as a new tracking task for improvements that we were not able to make to the Dumps on Airflow system before go-live. I moved most of the outstanding tickets from this tree to that new one.
So I think that we can close this ticket now.

Change #1179639 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add another 10 TB to the legacy dumps pvc

https://gerrit.wikimedia.org/r/1179639

Change #1179639 merged by jenkins-bot:

[operations/deployment-charts@master] Add another 10 TB to the legacy dumps pvc

https://gerrit.wikimedia.org/r/1179639