Page MenuHomePhabricator

[EPIC] Create Automated Supervision test automation for CI
Open, Needs TriagePublic

Description

For the QS Automation Team to intake and work on for FY26/27.

To support Fy26-27 priorities for Developer Experience / SRE and help provide faster feedback cycles to developers, answer the following question:

"How do we know things are working properly in a fully automated deployment pipeline?"

Work will need to:

  • Define what "Critical User Journeys" are alongside a test automation engineer
  • Automate Critical User Journeys at an appropriate level (cURL to wikipedia.org? E2E UI automated test?)

Likely a large effort that will require input from Quality Leads (to determine what functionality is important to exercise for users), Core Experiences, and others.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The Pretrain project and follow up work to increase deployment frequency that we are planning for FY27 (July 2026-June 2027) would really benefit from critical user journey tests that can be targeted at "canary" servers during a scap deployment cycle in the WikiKube Kubernetes cluster. The general intent would be to validate that a newly prepared MediaWiki deployment is meeting real-time service level objectives when exercising the code via critical user journey tests as a gate in the deployment automation. If the SLO check succeeds we would move forward with promoting the image to handle all traffic for the active deployment. If the check fails then we would rollback completely to the pre-deployment state and raise a signal for investigation of the SLO check failure.

This is work of a slightly different scope than the original task description here, but I think it is directionally aligned.

Yeah that's what I was thinking about this supporting. I think that's the environment that should be targeted and these could run fast against that. Is there a tag that can be added to link the two together so we can keep aligned as work progresses?

Is there a tag that can be added to link the two together so we can keep aligned as work progresses?

We have just been hanging things off of T369112: Pretrain (née Group -1) QTE validation environment for Pretrain so far. I don't know that anyone has thought too much about how we will structure a task tree for the FY27 work yet. @CCiufo-WMF may or may not have opinions.

SLong-WMF renamed this task from [EPIC] Create Critical User Journey automation for CI to [EPIC] Create Automated Supervision test automation for CI.Apr 9 2026, 9:21 PM
SLong-WMF updated the task description. (Show Details)

Is there a tag that can be added to link the two together so we can keep aligned as work progresses?

We have just been hanging things off of T369112: Pretrain (née Group -1) QTE validation environment for Pretrain so far. I don't know that anyone has thought too much about how we will structure a task tree for the FY27 work yet. @CCiufo-WMF may or may not have opinions.

I don't have strong opinions. Some teams structure their tasks in Phab very explicitly around annual plan OKRs. I think structuring it around the top level (and non-OKR coded) goal task like we're doing for pretrain makes sense.

An emerging need in the Pretrain project is an understanding of critical user journeys that we can exercise using httpbb checks during a scap deployment. The initial Pretrain launch will target testwiki only. That environment receives something like 1 request per second of organic traffic. That will not be enough traffic for us to collect significant error signals during the normal in-deploy pause. We would like to add additional synthetic traffic to increase our chances of noticing a serious regression triggered by the active deployment. The more parts of the MediaWiki deployment that traffic exercises the better our chances of noticing a defect escape and rolling back to the prior deployment.

If the phase of this project that tries to enumerate and prioritize critical user journeys can start "soon" (May 2026?) then we may be able to reuse that list in the initial Pretrain work.

renamed this task from [EPIC] Create Critical User Journey automation for CI to [EPIC] Create Automated Supervision test automation for CI.

For Pretrain and the follow on work to generally decrease cycle time between a developer creating a patch and that patch being live on enwiki we broadly need a Critical User Journey (CUJ) test suite that can be targeted at one or more wikis in a particular deployment group to run after we have some Kubernetes pods running the new software and before we decide to promote the deployment to additional wikis. "Automated Supervision" is a term of art that SREs involved in the project are using to describe some collection of expert systems examining signals like logging output to determine if we are staying within allowed error budgets or if some SLO has been violated to trigger a halt and possible rollback of a deployment while humans are summoned to investigate further.

Having tests that exercise CUJ workflows in CI isolation could be a nice signal for teams working towards merging code, but the big need I see is CUJ testing "live" in production as part of the deployment and operational monitoring tooling we use. Having the ability to target tests at a base URL which may include non-standard ports and hostnames so that we can reuse the suite at various points in the deployment and operational lifecycle would be ideal. Being able to divide the tests into things that are quick and contained enough to run in-line with the deployment vs things that require more runtime resources would also be nice.