Page MenuHomePhabricator

WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes
Open, HighPublic

Description

Update Dec 2024

As per the APP FY24-25: OKRs and Hypotheses spreadsheet, this work has now been associated with a hypothesis in the WE 5.4 KR. The hypothesis has the ID of 5.4.4 and the wording is as follows:

If we decouple the legacy dumps processes from their current bare-metal hosts and instead run them as workloads on the DSE Kubernetes cluster, this will bring about demonstrable benefit to the maintainability of these data pipelines and facilitate the upgrade of PHP to version 8.1 by using shared mediawiki containers.

The ideal solution is still to schedule the dumps with Airflow, but this wording of the hypothesis deliberately makes no mention of this, since we do not wish to place Airflow integration on the critical path to achieving the stated objective by April 1st 2025. The option of using Kubernetes CronJob objects (or similar) would meet the objective.

Original ticket description below

While there's ongoing work to create a new architecture for dumps, we still need to support the current version. In concise terms, what happens now is that a set python scripts launch periodically some mediawiki maintenance scripts and the (very large) output of those is written to an NFS volume that is mounted from the dumpsdata host.

After some consideration of other plans that were outlined in this task originally, we've decided to actually port the dumps to run on k8s, as outlined .

In short (for the longer version see Ben's document) we want to do the following:

  • Create a new MediaWiki image that includes a python interpreter and the dumps code
  • Allow the MediaWiki chart to include a PersitentVolume mounted under /mnt/dumpsdata and the dumps configuration (now in puppet) under /etc/dumps/
  • Run a MediaWiki deployment on the DSE k8s cluster using a specific ceph volume to write the dumps to, substituting the function of NFS, running the python scripts as CronJob in a first implementation, moving to airflow later
  • Run a dependent job to copy the dumps files over to the server that will serve them to the public

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe triaged this task as Medium priority.Dec 4 2023, 11:25 AM
Joe added a subscriber: ArielGlenn.

This sounds like it would work... but I do want to point out a potential maintenance issue:

The three of us (Dan, Xabriel, Jennifer) from Data Products who onboarded onto Dumps 1.0 were doing so with the assumption that we were going to get *more* resources to finish Dumps 2.0 and that it would be done by now. So we never learned in-depth what maintaining the dumps scripts would entail. Since then, the Dumps 2.0 work was deprioritized almost completely for the last few months.

The problem is, what little we learned about Dumps 1.0 is changing a bit with the wrapper scripts and containerization described above. So that's a potential risk for long-term maintenance of Dumps 1.0. We are trying to advocate for re-prioritization but I also have a sabbatical coming up. Just wanted to detail the precarious nature of this situation so that decision makers know what's going on. cc @VirginiaPoundstone and @WDoranWMF.

Thanks for the flag @Milimetric.

@Joe thanks for thinking this through. I have three follow-up questions:

  1. what is the timeline you are thinking about for this work?
  2. Are there any hard deadlines?
  3. If the new dumps is not ready to replace the current dumps by end of fiscal year, what gets blocked and/or toil is endured?

While we have made progress, we still can not accurately predict how long it will take us to get to the new dumps (there are a couple big puzzles we still need to solve: scale & drift ), but we do plan on prioritizing the work towards end of Q3 23/24.

@Joe thanks for thinking this through. I have three follow-up questions:

Answers inline :)

  1. what is the timeline you are thinking about for this work?

We'd hope to start working on this late next quarter, or at the start of Q4 at the latest. We've gone this way when we were asked if we could live without dumps 2.0 before the end of the fiscal year, so we got creative. To be clear, what I am proposing here is a terrible hack we should never ever do.

And yes, we'll still need support from the team responsible for maintaining dumps.

  1. Are there any hard deadlines?

We need to have switched old dumps off completely, or to this new setup, by end of June 2024.

  1. If the new dumps is not ready to replace the current dumps by end of fiscal year, what gets blocked and/or toil is endured?

We're building even more technical debt on top of an infrastructure, dumps, that is both fundamental to our community and also a good pyle of tech debt itself. A platform that while fundamental doesn't have proper software-side support (based on what @Milimetric was saying).

While we have made progress, we still can not accurately predict how long it will take us to get to the new dumps (there are a couple big puzzles we still need to solve: scale & drift ), but we do plan on prioritizing the work towards end of Q3 23/24.

I don't think it's realistic to expect dumps 2.0 to have fully replaced the old dumps by the end of the fiscal year. And if we don't tie up these remaining loose ends, all the work we've done for the annual plan to migrate mediawiki on k8s would reap half the benefits.

So I think this work still needs to be done at this point.

Joe raised the priority of this task from Medium to High.Jul 8 2024, 5:40 AM

The priority of this task has become high, as doing this is currently a blocker to finishing the k8s migration of mediawiki and thus moving to php 8.x, which is desirable for various reasons and is a request from a lot of teams.

So I see two ways forward:

  1. We do this work to detach the dumps infra from the puppet-driven mediawiki installation as described in this task
  2. Data engineering forks the puppet code for mediawiki in a separate class structure just for dumps and works themselves on supporting php 8.x there.

I don't really want to see the latter happening, as I think it's a larger effort and a larger maintenance nuisance (it would mean that DE needs to work every time we update php version for the main cluster), but up to you.

@Milimetric @VirginiaPoundstone @WDoranWMF the timeline serviceops was looking at for tentatively updating php to version 8 is Q2 of this FY. So this work has now become urgent, and we need to make a decision on which direction we want to take.

Thanks @Joe for the update and pushing this forward. The Dumps 2.0 work has already been picked back up and is now even under an OKR. We are aiming for October for the second phase to be completed - that is the intermediary table up in iceberg: https://docs.google.com/document/d/19KVpwWCLMJKtPy8VdcIXzPeKkeREvD1tAHeHY9I5778/edit

That gives us the foundation to build the files - for the wikis, there are additional dumps that need to be migrated and that will likely be a long tail.

@Milimetric and I reviewed just now, and we're not positive we understand the level of effort you need from us on this and we're worried about the potential hidden monsters inside the changes given how fragile this is as a whole. We're going to discuss it and then come back to you quickly.

@Joe thanks for the ping. Just to keep you posted about our progress: Will is out this week but he and Data SRE are in conversation to see what we can accomplish. Should know more mid next week.

Is there an annual plan OKR that the upgrade work is connected to?

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

IIUC, the PHP 8 issue with be the same with containerized MW. I also don't know exactly what it is. :)

This proposal will allow SRE to decom all the existent scap and puppet deployment stuff for deploying MW to bare metal.

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

Well there's quite a few things to consider, but it boils down to a few facts:

  • You can't "just upgrade php". You need to update the puppet code related to it and the debian distribution we're running on (as we only provide php 8.x packages from bullseye onwards)
  • Puppet code for mediawiki is on the way out, there is no point in spending the time to support php 8 there. You can of course do that work instead of trying to modify dumps to run containerized mediawiki, but out of my experience (I wrote quite a bit of that code) I think that is going to be significantly more work
  • To use php 8, we first need to upgrade to debian bullseye and (eventually) bookworm. This means you will also have to upgrade to a newer distribution and a newer python version in a short-ish timeline. If we run a containerized mediawiki, we detach the process of updating php and updating the old dumps servers.
  • If these servers aren't dismissed, upgrading them to bookworm during this FY or early in the next will become mandatory, as we will move forward quicklier now that we are building containers and you won't be running on an outdated version of mediawiki

The alternative is to install docker on those servers, distribute the mediawiki image there (both things serviceops can take care of easily), and then experiment with running mediawiki-in-docker in the dumps scripts. It seems like the convenient option to me, but YMMV.

I think it will be quite a while before we are fully able to decom Dumps 1. This task will unblock SRE's MW to k8s migration, and will allow them to remove all the complicated puppet and scap code supporting the bare metal MW deployment.

We should do this! But who will do it and how much work is it? ;)

Just for the record, we met and discussed @Joe's proposal (this task's description) and were in general agreement that it's the best way forward. We have follow-up discussions to have and coordination to do, but we're aligned on the idea.

To confirm understanding, did we have a leaning on whether the containerized version would pin to an older version of MediaWiki versus whether it would need to keep getting MediaWiki updates?

If pinning to a specific version would that pose problems to running the code as is in perpetuity (meaning, let's say as a though experiment, 12-18 months) and producing stable dumps?

We discussed some of this earlier in the week I think, but I wanted to make sure we captured it in writing as well. Part of the basis for the question is understanding how much extra effort there may be if needing to keep building new versions of the image versus how much effort there may be (or not be) if sticking to a particular image until the system is superseded.

Following up on some discussions:

  • @BTullis confirmed that Data Platform SRE is okay with Service Ops being the one to containerize, as opposed to Data Platform SRE being the one to containerize. This was an open question from the last meeting given the attendees (Ben was out), and I wanted to close the loop on this.
  • @Joe can you confirm the Service Ops approach to cutting new new images post the introduction of the first image? Specifically, would you anticipate MediaWiki updates (core, extensions, and so on) being applied to new images periodically and what about security patches? I guess there's the pre-PHP 8 world and there's the world once things are on PHP 8, as well, so maybe it's useful to understand these piecewise.
  • It sounds like Q2 FY 24-25 (i.e., October 2024 - December 2024) is the most realistic timeframe for containerization. Does that sound good @Joe @WDoranWMF? We're mindful of the realities of mid-November to mid-January (holidays, banner runs, freeze periods, etc.) and the need for stability and continuity for dumps runs and the queried databases.
  • If I'm understanding correctly, people are thinking that we cutover from bare metal execution to use of container entrypoints for the jobs, and that we aren't necessarily running bare metal side-by-side with containers. This seems like a risk, at least if we consider the possibility that new stuff may be more prone to breakage and that such breakage could lead to lagged data delivery which can have a compounding effect in the infrastructure. On the one hand we could execute bare metal followed by container to verify like-for-like behavior for given job runs without crushing the databases, but on the other I wonder if we end up bumping into job sequencing issues there given the times of month when things run for the really big dumps. I'd be really interested to hear @xcollazo's thoughts here and what seems like the safest way forward.
  • If I'm understanding correctly, people are thinking that we cutover from bare metal execution to use of container entrypoints for the jobs, and that we aren't necessarily running bare metal side-by-side with containers. This seems like a risk, at least if we consider the possibility that new stuff may be more prone to breakage and that such breakage could lead to lagged data delivery which can have a compounding effect in the infrastructure.

I think there were 2 ideas:

  1. Chase all the places in which our code runs mwscript in the Dumps code, and replace it with equivalent docker run calls.
  2. Do a rewrite that would run completely from the container, and perhaps executable from k8s.

IIRC, folks were leaning towards #1 due to way narrower scope.

On the one hand we could execute bare metal followed by container to verify like-for-like behavior for given job runs without crushing the databases, but on the other I wonder if we end up bumping into job sequencing issues there given the times of month when things run for the really big dumps. I'd be really interested to hear @xcollazo's thoughts here and what seems like the safest way forward.

We can and should do a couple of test runs with, say, simplewiki. But yeah, big dumps like enwiki will be difficult to run twice and compare. We can always compare to previous month and figure whether the files look reasonable, as we did in T365155.

@Joe checking - is Q2 FY 24-25 still looking good for Service Ops containerization work? Or is it looking more like Q3? Either way, any guesses on rough range of dates in which this effort might start?

@Joe checking - is Q2 FY 24-25 still looking good for Service Ops containerization work? Or is it looking more like Q3? Either way, any guesses on rough range of dates in which this effort might start?

Yes, my hope is to get to work on this in november or at the latest in early december when the change freeze goes into effect.

I think there were 2 ideas:

  1. Chase all the places in which our code runs mwscript in the Dumps code, and replace it with equivalent docker run calls.
  2. Do a rewrite that would run completely from the container, and perhaps executable from k8s.

IIRC, folks were leaning towards #1 due to way narrower scope.

I hope nobody minds, but I'd like to put forward another idea regarding the containerisation/kubernetes element.

I wrote this proposal a couple of weeks ago about how the miscellaneous dumps (i.e. everything apart from the SQL/XML dumps) could be run on the dse-k8s cluster and scheduled with Airflow.
Migrating the Misc Dumps to Airflow - The reason I started writing it is that some of these dumps aren't within scope for the Dumps 2.0 project, but will still be needed for a long time.

However, the more I think about it, the more I think that it makes sense to expand the proposal to include these XML/SQL dumps in this migration as well, for as long as we would need to look after them.

The key elements of this proposal are:

  • We use the analytics airflow instance (or a new instance) as the scheduler, to launch DAGs that perform the dumps as tasks.
  • A dump task uses the KubernetesPodOperator to launch a Mediawiki container and perform the dumps by using the existing dumps scripts.
  • The Data-Platform-SRE team will begin writing and deploying these DAGs and will work with other teams, as required, to make sure that the Mediwiki pod uses common tooling and standards.
  • The dumps will run as the dumpsgen POSIX user, as they currently do on the snapshot10[10-17] hosts.
  • Instead of being written to the current NFS servers (dumpsdata100[3-7]) the dumps would be written to a CephFS volume that is mounted in the Mediawiki task pod.
  • Upon completion (or at regular intervals), another task pod starts up to rsync the dumps from the CephFS volume to the current dumps distribution servers, clouddumps100[1-2]

I've included a diagram from the original proposal below, which illustrates this.
In addition to the rsync task pod, this diagram also includes a task to write the completed dump to an S3 bucket, but that's not necessarily a requirement for these dumps. However, it's still an option.

Misc snapshot - proposed architecture.drawio.png (1×846 px, 150 KB)

I think that getting replacing NFS with CephFS would be really beneficial in this case, because NFS is intrinsically a single point-of-failure and requires a lot of work to keep standby servers in sync, then fail over and back.

The proposal doesn't include replacing the current clouddumps100[1-2] servers, which are the public-facing side of dumps. I thought it best to try to keep this element stable in order to keep the proposal within a reasonable scope.

To my mind, there is a strong element of lift and shift about this, because we will be using the existing dump scripts and mwmaint, but the main benefit is that it would free us from the hard depencdency on individual snapshot and dumpsdata servers. We would also benefit from much greater observability by using Airflow as the scheduler, rather than the systemd timers we have at the moment.

I also feel that it would be achievable in a reasonable timeframe. We have all of the infrastructure in place, right now.
If we can free up the current snapshot101[0-7] servers, these could then be renamed as dse-k8s-worker nodes and then added to the dse-k8s cluster, in the same way that the mediawiki application servers were added to wikikube.

I'm keen to hear your feedback, even if you think that the idea is bunkum.

I think @BTullis' idea is great - there's a few unknowns regarding how to keep what runs via airflow in sync with what runs on k8s via helm charts, but that's something I'd rather invest my time in than writing some shim to run mwscript on a physical host. That was a solution I proposed to reduce effort overall but I think it makes sense for multiple reasons, first of all the operational toil of the DE SRE team, to go with his proposal.

I'm keen to hear your feedback, even if you think that the idea is bunkum.

It is not bunkum. Quite the contrary. I left a few comments in the doc, but overall this looks pretty promising. It has the potential to revamp the dumps experience both for users and operators and lower maintenance costs for DE SRE.

AeinBagheri set Due Date to Oct 28 2024, 12:00 AM.
AeinBagheri set the point value for this task to 100.
Reedy removed the point value for this task.
Reedy removed Due Date.
Reedy added a subscriber: AeinBagheri.

My opinion it totally not relevant but I've read the proposal and it seems great, thanks Ben! +1

I suspect I've missed something, so forgive an obvious question, but: how does this tie in with remediating T368098 ? i.e. the impact upon production databases from running dumps.

I suspect I've missed something, so forgive an obvious question, but: how does this tie in with remediating T368098 ? i.e. the impact upon production databases from running dumps.

It's a very good question. As it stands, this proposal doesn't really do anything to remediate T368098 other than increasing the observability around dumps 1.0 and the database traffic associated with each task.

When developing the dumps DAGs we may well find opportunities to vary the parallelism of dumps by means bypassing the current scheduler, but I wouldn't like to make any specific claims up-front about reducing the database load.

The real gains in terms of reducing the database load will come with the Dumps 2.0 project, which is happening in parallel with this work. For information about that, see:

I'll add the Epic tag and move this back to the main Data-Platform-SRE workboard, so we can start creating child tickets to work on this.
There's still some work to do on re-drafting the proposal based on what we have discussed, but I think that we can start outlining some tasks anyway.

On the DPE SRE side, we will certainly need to:

  • Create a ceph filesystem for dumps
  • Identify a suitable dump command that we can run to test the functionality, such as a single wiki

For the image creation, we will need an image based on mediawiki with the following additional components...

  • A python3 interpreter - We currently use the system default python 3.9 from bullseye on the snapshot hosts
  • A local checkout of https://gerrit.wikimedia.org/g/operations/dumps - it's currently /srv/deployment/dumps on the snapshot hosts, but this location isn't important.

We have a number of configuration files that are currently managed by puppet on the snapshot hosts.

btullis@snapshot1013:/etc/dumps$ find . -type f
./confs/wikidump.conf.tests
./confs/addschanges.conf.labs
./confs/wikidump.conf.dumps
./confs/table_jobs.yaml
./confs/wikidump.conf.other
./confs/addschanges.conf
./confs/wikidump.conf.labs
./confs/wikidump.conf.other.labs
./templs/feed.xml
./templs/incrs-index.html
./templs/download-index.html
./templs/README
./templs/report.html
./templs/errormail.txt
./stages/stages_create_smallwikis
./stages/stages_create_wikidatawiki
./stages/stages_full_enwiki
./stages/stages_create_enwiki
./stages/stages_full
./stages/stages_partial
./stages/stages_partial_wikidatawiki
./stages/stages_partial_enwiki
./stages/stages_create_bigwikis
./stages/stages_full_wikidatawiki
./dblists/bigwikis-labs.dblist
./dblists/wikidatawiki.dblist
./dblists/skip-labs.dblist
./dblists/skip.dblist
./dblists/globalusage.dblist
./dblists/bigwikis.dblist
./dblists/README
./dblists/skipnone.dblist
./dblists/enwiki.dblist
./dblists/skipmonitor.dblist
./cache/running_cache.txt

I think that we can look at whether or not we need these on a case-by-case basis, then handle what we need with configmaps etc. I suspect that we will need very few of them, in the end.

We have now created a cephfs file system for dumps.

btullis@cephosd1001:~$ sudo ceph fs volume create dumps

We can see that one of the existing standby MDS servers has now been assigned to the active role for this file system.

Volume created successfully (no MDS daemons created)
btullis@cephosd1001:~$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
 
  services:
    mon: 5 daemons, quorum cephosd1001,cephosd1002,cephosd1003,cephosd1004,cephosd1005 (age 23h)
    mgr: cephosd1005(active, since 24h), standbys: cephosd1001, cephosd1002, cephosd1003, cephosd1004
    mds: 2/2 daemons up, 3 standby
    osd: 100 osds: 100 up (since 23h), 100 in (since 6M)
    rgw: 5 daemons active (5 hosts, 1 zones)
 
  data:
    volumes: 2/2 healthy
    pools:   14 pools, 3347 pgs
    objects: 67.85k objects, 17 GiB
    usage:   28 TiB used, 1.1 PiB / 1.1 PiB avail
    pgs:     3347 active+clean
 
  io:
    client:   35 KiB/s wr, 0 op/s rd, 9 op/s wr

We want to put the data for this file system on the HDDs and the metadata on the SSDs.

Currently, they are both assigned the hdd CRUSH rule.

btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule
crush_rule: hdd
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.data crush_rule
crush_rule: hdd

We can change the CRUSH rule associated with the cephfs.dumps.meta pool like this.

btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dumps.meta crush_rule ssd
set pool 18 crush_rule to ssd
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule
crush_rule: ssd
Joe renamed this task from Migrate current-generation dumps to run from our containerized images to Migrate current-generation dumps to run on kubernetes.Wed, Dec 4, 10:23 AM
Joe updated the task description. (Show Details)

Change #1104605 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: Add support for dumps persistent _volumes

https://gerrit.wikimedia.org/r/1104605

I have added a patch to the mediawiki chart to allow claiming persistent volumes for dumps specifically. @BTullis I guess when both this patch and the one to add the needed stuff to the debug image are ready, I think you'll be ready to create a deployment on the DSE cluster.

I think it might make sense to also link the files under helmfile.d/services/_mediawiki-common_ in your space, as we definitely want to keep the settings in sync.

Let me know if you need further assistance in setting that up.

Thanks @Joe - I think that looks useful.

I'd also like to catch up on the current work with regard to how this chart is being used to deploy Job objects, if possible.
I see various references to mwscript and mercurius here, so if you (or someone) could let me know the current direction of travel regarding these two things, I'd be grateful.

[...]

I see various references to mwscript and mercurius here, so if you (or someone) could let me know the current direction of travel regarding these two things, I'd be grateful.

Not sure that's the right file, but Mercurius and mwscript are two Job-type deployments of mediawiki that don't include httpd. Yours would be a third one. You'd also have to add the command that's going to be ran to the MediaWiki container part of the templates/lamp/deployment.yaml.tpl file, and exclude dumps from the liveness probe in that same file.

The rest of the sidecars etc. can be configured from the values.yaml files.

I'm in the process of refactoring the mediawiki chart, but I don't want to block you moving forwards, so I'll port any changes you make when I have something that works.

You probably also want to if-guard the service definition in templates/service.yaml.tpl

BTullis renamed this task from Migrate current-generation dumps to run on kubernetes to WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/55 - Migrate current-generation dumps to run on kubernetes.Thu, Dec 19, 12:30 PM
BTullis updated the task description. (Show Details)
BTullis set Due Date to Apr 1 2025, 8:00 AM.