Page MenuHomePhabricator

Migrate current-generation dumps to run from our containerized images
Open, MediumPublic

Description

While there's ongoing work to create a new architecture for dumps, we still need to support the current version. In concise terms, what happens now is that a set python scripts launch periodically some mediawiki maintenance scripts and the (very large) output of those is written to an NFS volume that is mounted from the dumpsdata host.

While I think it could be possible, with some extreme gymnastics, to make these run in kubernetes, I see little value, especially as it's a system we want to eventually sunset. In case this assumption of mine is wrong, then we might want to reconsider the plan I'm going to lie out in this task.

Specifically, I start from the assumption that we want two things, by migrating to kubernetes:

  • Get rid of all the mediawiki-related puppet code, eventually
  • Get rid of any non-k8s-related scap features, unify our deployments

I think this can be achieved as follows:

  • We install the container runtime on the snapshot hosts
  • As part of scap, we presync the mediawiki php image to all kubernetes nodes; also push the images to the snapshot hosts now
  • We provide a wrapper script to run mediawiki maintenance scripts from a docker image mounting the right volumes
  • We will run the needed sidecars (I think we only need mcrouter) as docker-images running on these hosts

Event Timeline

Joe triaged this task as Medium priority.Dec 4 2023, 11:25 AM
Joe created this task.
Joe added a subscriber: ArielGlenn.

This sounds like it would work... but I do want to point out a potential maintenance issue:

The three of us (Dan, Xabriel, Jennifer) from Data Products who onboarded onto Dumps 1.0 were doing so with the assumption that we were going to get *more* resources to finish Dumps 2.0 and that it would be done by now. So we never learned in-depth what maintaining the dumps scripts would entail. Since then, the Dumps 2.0 work was deprioritized almost completely for the last few months.

The problem is, what little we learned about Dumps 1.0 is changing a bit with the wrapper scripts and containerization described above. So that's a potential risk for long-term maintenance of Dumps 1.0. We are trying to advocate for re-prioritization but I also have a sabbatical coming up. Just wanted to detail the precarious nature of this situation so that decision makers know what's going on. cc @VirginiaPoundstone and @WDoranWMF.

Thanks for the flag @Milimetric.

@Joe thanks for thinking this through. I have three follow-up questions:

  1. what is the timeline you are thinking about for this work?
  2. Are there any hard deadlines?
  3. If the new dumps is not ready to replace the current dumps by end of fiscal year, what gets blocked and/or toil is endured?

While we have made progress, we still can not accurately predict how long it will take us to get to the new dumps (there are a couple big puzzles we still need to solve: scale & drift ), but we do plan on prioritizing the work towards end of Q3 23/24.

@Joe thanks for thinking this through. I have three follow-up questions:

Answers inline :)

  1. what is the timeline you are thinking about for this work?

We'd hope to start working on this late next quarter, or at the start of Q4 at the latest. We've gone this way when we were asked if we could live without dumps 2.0 before the end of the fiscal year, so we got creative. To be clear, what I am proposing here is a terrible hack we should never ever do.

And yes, we'll still need support from the team responsible for maintaining dumps.

  1. Are there any hard deadlines?

We need to have switched old dumps off completely, or to this new setup, by end of June 2024.

  1. If the new dumps is not ready to replace the current dumps by end of fiscal year, what gets blocked and/or toil is endured?

We're building even more technical debt on top of an infrastructure, dumps, that is both fundamental to our community and also a good pyle of tech debt itself. A platform that while fundamental doesn't have proper software-side support (based on what @Milimetric was saying).

While we have made progress, we still can not accurately predict how long it will take us to get to the new dumps (there are a couple big puzzles we still need to solve: scale & drift ), but we do plan on prioritizing the work towards end of Q3 23/24.

I don't think it's realistic to expect dumps 2.0 to have fully replaced the old dumps by the end of the fiscal year. And if we don't tie up these remaining loose ends, all the work we've done for the annual plan to migrate mediawiki on k8s would reap half the benefits.

So I think this work still needs to be done at this point.