Page MenuHomePhabricator

Draft and get approval for next hypothesis to follow WE6.2.1
Closed, ResolvedPublic

Description

As we near completion on T369115: [FY24-25 WE6.2.1] Publish pre-train single version containers it is time to figure out what our next measurable milestone on the way to implementing the full Group -1 concept will be.

At a tactical level we need to:

  • Work with the folks in ServiceOps new to select a Kubernetes deployment environment for the new wmf/next containers. This might take a number of different forms (new Deployment, new namespace, isolated Kubernetes cluster, etc) depending on the threat modeling and workload isolation requirements that are discovered.
  • Work with Quality-and-Test-Engineering-Team to select an initial scope/role for the deployed containers. Running a new testing wiki, taking over an existing wiki(s), or powering an mwdebug testing server are some of the easily listed options here.
  • Determine compensating controls needed (if any) before enabling automated deployment into the target environment.
  • Implement the necessary code and configuration to deploy the container to the selected location with the selected scope/role.

Event Timeline

bd808 triaged this task as High priority.

@thcipriani For your consideration. I would love some help thinking about what our measurable outcomes might be for this next phase of the project. It seems likely that picking the scope/role that we want to target before committing to the hypothesis would be helpful. Then we could say something like:

If we deploy a wmf/next container to production for use as [scope/role here] 4 or more times per week we will ...

I have spent some time day dreaming about deploying wmf/next to a new mwdebug pod so that anyone with the WikimediaDebug add-on could test wmf/next. Mwdebug nodes are also something that we already know how to do special edge routing for which removes a potential new question from this phase of the project. A desirable things I see with mwdebug is that it allows testing with any/all wikis in the cluster including workflows that cross wiki boundaries (SUL login, commons publishing, wikidata retrieval). It would not however allow testing of job queue executed changes.

The biggest arguments against mwdebug I have found so far are concerns about cache pollution/invalidation from traffic running an N+1 (or even N+2 depending on timing) version of MediaWiki from wmf/next mixing with normal user traffic. A train rollback provides an opportunity for this sort of data problem today, but it does so with generally much shorter windows of opportunity in that we tend to see rollbacks happen relatively soon after roll forward. A counter point is that it would be unlikely to see adoption of wmf/next via WikimediaDebug at a massive scale; I would assume tens of users with say 200-400 as a relatively high upper bound vs the thousands of users that rolling to Group 2 exposes to newer code.

I have also heard some concerns about potential data loss or corruption which also seem valid, but maybe not something extremely new. The train already brings the possibility of data corruption each week.

I still see some interesting possibilities in an mwdebug approach, but it also seems like it would be easier to reason about after we have some existing experience with wmf/next in general.

If we take mwdebug idea off the table then we have the new test wiki just for this group (test3wiki?) vs move one or more existing wikis to the group.

Adding a new wiki would be the least disruptive to any existing workflows. It would also however require all the things that inventing a new wiki requires. The biggest part of which is attracting enough wiki functionary role holders to keep things from drowning in bad faith edits.

Today there are a number of group 0 wikis that could be candidates for moving to group -1: testwiki, testwikidatawiki, testcommonswiki, officewiki, mediawikiwiki. I would not personally mind seeing labswiki (Wikitech) which is currently in group 1 moved to group -1 too.

A tiered rollout starting with one wiki and then adding more as confidence in the general system improves also seems prudent.

A single wiki would side-step many many of the problems you've unearthed investigating WikimediaDebug idea (though I do love the WikimediaDebug idea, we may not yet be ready for it/it may be too big for a next step). Deploying a single wiki with wmf/next seems like an easier step to take.

I'm curious the benefit of building a test3wiki vs. using test2wiki?

I still wonder if there's work to do before deploying wmf/next to a single wiki. After we're building a single-version wmf/next image, is it still an open question what mechanism will deploy it in a way that doesn't clash with backports/train? Or was that within scope of the initial phase of this project? I remember discussing this, but it seems like work has been focused on image building, is that right?

If that is correct, there may be a few things to work out there:

  1. What deployment locks (if any) are set.
  2. Will it be deployed from the deployment server? Manually or automatically? How often and by whom?
  3. How do we stop it from being deployed? Will we ever need to rollback?

Design and initial implementation to address these problems are something we could work to resolve while ServiceOps new (cc @akosiaris) works towards request routing towards single-version images. When those two pieces come together, we'd both be ready for test2wiki (or test3wiki or officewiki or whatever wiki feels safe to try).

I'm curious the benefit of building a test3wiki vs. using test2wiki?

This might be a good question for QTE folks to help us think about. Today we have testwiki in Group 0 and test2wiki in Group 1. Would we lose anything important by moving one of those existing wikis to Group -1? I think another way ask that question is to ask if there is a specific reason to have a dedicated test wiki in Group 0 or Group 1 assuming that there is at least one dedicated test wiki in an earlier group.

I still wonder if there's work to do before deploying wmf/next to a single wiki. After we're building a single-version wmf/next image, is it still an open question what mechanism will deploy it in a way that doesn't clash with backports/train? Or was that within scope of the initial phase of this project? I remember discussing this, but it seems like work has been focused on image building, is that right?

The work in the first hypothesis will produce an image for us based on a timer schedule. At the moment that schedule is once per day, but we will have the ability to change this cadence as we desire. This image build process was designed so that it will not contend with other uses of scap on the deployment servers for locks. I have been generally assuming that the deployment will be something like helmfile -e $CLUSTER -i apply --context 5 which would target a unique Group -1 Deployment and thus only conflict with itself. That is an assumption however and not a known fact at this point.

If that is correct, there may be a few things to work out there:

  1. What deployment locks (if any) are set.
  2. Will it be deployed from the deployment server? Manually or automatically? How often and by whom?
  3. How do we stop it from being deployed? Will we ever need to rollback?

I think we could pretty easily do a daily manual deployment. I don't think we would want to try to manually deploy more frequently than a couple of times per normal work day. We could also consider a "when someone in the QTE role wants it" manual cadence where we train the QTE folks to do this deployment as a self-service operation.

In the initial scope here I said "I was tempted to continue the tactical list to include automating the deployment as well, but on reflection I think the hypothesis model would be better used if we stop to reflect before forging on to that next more controversial topic." I am certainly willing to rethink this if hands-free automated deployment is not as controversial as I am assuming.

Design and initial implementation to address these problems are something we could work to resolve while ServiceOps new (cc @akosiaris) works towards request routing towards single-version images. When those two pieces come together, we'd both be ready for test2wiki (or test3wiki or officewiki or whatever wiki feels safe to try).

I am not exactly sure how we would word it at this point, but we could think about making the Q2 hypothesis be focused on research and discussion of the known open questions about where the container will run, how the deployment will be triggered, how often the deployment will happen, how traffic will be routed to the container with the desired outcome by the end of December 2024 (FY24/25 Q2) being a written strategy and tactical plan for execution in the following quarter.

More "good question" questions:

  • Will we need to incorporate mediawiki-config changes into the wmf/next image in real-time?
  • Will we need to incorporate security patches into the wmf/next image in real-time?
  • Do the answers to the config and security questions change based on the deployment cadence? Is there a cadence that makes it easier/harder to wait for convergence?

Back channel discussions between @bd808 and @thcipriani are trending towards the next necessary step being work to define a strategy and tactics for the deployment and use of the wmf/next image that have consensus across RelEng, SRE, and QTE. There are a number of open questions at this point that will need discussion to resolve. There may also ultimately be an dependency on the work by SRE to define a system for handling single version routing in the k8s cluster.

The October-December quarter is missing a lot of working days as a result of the US Thanksgiving holiday and the WMF global end of year holidays. The "Big English" fundraising campaign in December also typically comes with a freeze on deployments that might negatively impact that critical event.

A new question to ask: Is there provable test coverage improvement when we test the combined wmf branch vs the tests against each included hash?

This has arisen as I think about how we might speed up the image publication pipeline. Currently we prepare a git commit that bumps all of the submodules for wmf/next and send it through Gerrit to merge. This fires Zuul's gate-and-submit pipeline on mediawiki/core.git which is taking about 30 minutes of wallclock time to complete. Naively it is unclear what new combination of code is tested by this pipeline after asserting that each extension, skin, and core hash involved in the freshly prepared wmf/next branch has already been through gate-and-submit before being merged into its related repo.

bd808 changed the task status from Open to In Progress.Nov 7 2024, 6:32 PM