Page MenuHomePhabricator

Better prevent changes in MW core from breaking WMF production extensions
Open, Needs TriagePublic

Description

The problem

Currently, CI does not always prevent changes in MW core from breaking something in an extension, or changes in one extension from breaking something in another extension. This is because our existing mechanisms for cross-repo testing don't always cover these cases:

  • The "shared gate" job runs for "gated" repos. Whenever a change is proposed or +2ed in a gated repo, this job runs the unit tests for all other gated repos.
  • The regular CI job that runs for all repos only runs the tests for explicitly declared dependencies of that repo.

The set of gated repos currently includes MW core, 51 extensions and 2 skins. Because of the shared gate job, it's impossible for a change in any of these repos to break any of the others, and in particular it's impossible for a change in MW core to break one of the gated extensions/skins' PHPUnit or QUnit tests.

However, there are an additional 142 extensions and 6 skins that are used in WMF production but are not in the shared gate. The CI jobs for MW core do not run the tests for these repos, so it's possible for a change in MW core to break one of these repos without CI noticing(*). This happens with some frequency, see e.g T404229: ReadingLists test failures, T375543: CI failing for ReadingLists extension, T400549: CI failure: LogicException: Instances of MediaWiki\HookContainer\HookContainer are not serializable!, T376000: ArticlePlaceholder test fails in CI on empty GrowthExperiments change on master. In fact, there's a whole Phab project dedicated to these problems: ci-test-error (WMF-deployed Build Failure) (**), where 103 tasks were filed and resolved in the past year (though not all of these were caused by this issue). In addition, despite the shared gate job, it's also possible for a change in MW core to break phan in extensions, e.g. T401468: Phan failures in Wikibase after Status and StatusValue documented as generic (templated) classes. All of these things cause confusion among engineers working on the now-broken extensions, and even get in the way of fixing UBNs from time to time.

(*) It's also theoretically possible for a change in one extension to break another, but that's less common, and in cases where that's likely to happen in practice, a dependency relationship is usually already in place. This task focuses on the more common problem of MW core changes breaking deployed extensions/skins.

(**) Most of the open tasks in this project are about flaky Selenium tests. To see more examples of cross-repo unit test or Phan failures, look at resolved tasks in this project.

Why this is hard

Previously, the Release-Engineering-Team tried to fix this problem by adding all production extensions+skins to the shared gate (T249674: Have all Wikimedia production extensions and skins in the CI gate), but partway through that it became clear that that was not sustainable. Right now only about 1/4th of the production extensions+skins are gated, and already the shared gate jobs are pretty slow: ~16 minutes for PHPUnit+QUnit, and 23 minutes for Selenium (these two run in parallel). Adding even more repos to the gate would further slow these jobs, and also make them more fragile. We could try to invent different tools than the current concept of a shared gate, but they would still have to contend with the fact that running tests on 201 repos would be slower and more fragile than running them on the current 53.

We also don't actually have all of these extension installed on every production wiki. There are pairs of extensions that are never installed together, and there are extensions (like CentralAuth) that, when installed, make it impossible to test certain parts of MW core that we do run in production. So just installing all the extensions and running the tests is not necessarily a silver bullet, or even a good idea.

In the meantime, the recent discussions on T403560: ReadingLists tests not run in jenkins wmf-quibble-core-vendor-mysql-php81 and T404058: Add MetricsPlatform to gated extensions have exposed that there's no clear policy for which repos should be gated, and a number of repos that are currently in the gate probably wouldn't be let in if they requested it today. While it may make sense to decline those individual tasks in the current context, they're symptoms of a broader problem that this task attempts to describe.

Event Timeline

Some ideas that may or may not be helpful in trying to think about a better (or less bad) approach to this problem:

  • Run the tests against all deployed extensions as a nightly job, rather than on every change. This wouldn't provide immediate feedback, but we'd still find problems relatively quickly. This wouldn't have to be a single mega-job that runs all tests together either, but could instead run a regular CI job for each extension. Set a norm that MW core changes that break the nightly are reverted. (For bonus points: use git blame's automation features to have the job identify the change that broke it itself)
  • Make "test this change against all other deployed repos" an opt-in thing, invoked through a Gerrit comment like check deployed. That would be legitimately helpful for testing risky changes that the author/reviewer knows are risky (because testing locally against all other repos is hard), which is probably about 50% of the changes that break things. But it wouldn't help with the other 50% where nobody realized that the change could break another repo.
  • Make running the tests against all deployed repos faster by parallelizing them into 3-5 jobs that each install all extensions, but only run the tests for a subset of them. This could help with speed, but would not help with resource consumption
  • Don't run the Selenium tests cross-repo, only run the unit tests. This would help with speed (the shared gate Selenium job is slower than the shared gate unit test job) and reliability (most reports of flakiness are in the Selenium tests rather than the unit tests), but it would reduce our ability to find problems.
  • Analyze how much longer it would actually take to run the tests for most of the non-gated extensions, to quantify how much slower the gate would become.
  • Analyze how much each of the current gated extensions contributes to the running time of the gate, to quantify how much of it "death by a thousand cuts" vs a few slow extensions/tests. Then consider skipping slow tests (only in the gate, not when running CI on the extension itself).
  • Add a shared gate job for phan, to prevent MW core changes from breaking phan for gated extensions at least
  • Add a shared gate job that runs the unit tests for gated repos + CentralAuth (in addition to the existing job without CentralAuth). This would solve the "CentralAuth overrides too much MW core functionality" problem, but it would also double resource consumption.

A nightly job for all deployed extensions seems useful and check deployed sounds good to have also.

Another option could be to specify reverse dependencies for an extension (similar to dependencies.yaml or perhaps add a reverse_trigger flag when listing the dependencies:

ReadingLists:
  recurse: false
  dependencies:
    - name: BetaFeatures
      reverse_trigger: true

and/or we could have a reverse_dependencies.yaml that defines things that should be tested when a change is merged:

mediawiki/core:
  - ReadingLists

mediawiki/skins/MinervaNeue:
  - ReadingLists

mediawiki/skins/Vector:
  - ReadingLists

mediawiki/extensions/BetaFeatures:
  - ReadingLists

We could set this up as an experimental job and start with ReadingLists. If this works. well, then gradually add other production-deployed extensions that are not part of gate-and-submit and have this be a post-merge job or blocking.

Change #1268556 had a related patch set uploaded (by Aude; author: Aude):

[integration/config@master] POC - quibble-with-reverse-dependencies-vendor-mysql-php83 job

https://gerrit.wikimedia.org/r/1268556

I put together a proof of concept for having a reverse_dependencies job

https://gerrit.wikimedia.org/r/c/integration/config/+/1268556

I tested this flow locally, with generating the Jenkins config.xml for the new job and then ran the tests with quibble locally and verify phpunit tests for Vector + ReadingLists run as expected and pass.

For now, it runs only phpunit tests but could be expanded to run node/npm and other tests to give broader coverage.

A post-merge job doesn't meet the fundamental requirement of preventing changes landing that will break other things, it just creates an alert that we hope people will notice, care about, and react to. Is that sufficient, at least as a starting point?

I think post-merge or non-voting would be okay as a starting point, with the alerting.

Once we demonstrate the approach works well, we can have it be blocking similar to gate-and-submit.

Thanks for bringing this to my attention. From the QS side we've been doing a few things (and are planning on doing) that might help:

First off: I created a dashboard that tries to collect a ton of "quality" related metrics and centralize them. https://mw-qa-dashboard.toolforge.org/#/web

Additionally there are a few tickets my team is expecting to take on that should help provide some guidance:

Automated Supervision tests - Creating a suite of tests that run on deploy to ensure major functionality (to be defined) is functional. Likely high level UI tests (but few in number) that cover major pieces of functionality (read an article, edit an article, pictures display, etc). Very unlikely to break but also very impactful if it does.

Move tests from beta cluster to catalyst - Catalyst should be a more stable and efficient place to run UI automation, so the idea behind this epic is to start transitioning more tests into daily runs against a Catalyst environment set up for more isolated testing.

Neither of these are the level of granularity described here (I think that the above solution of running PHPUnit tests nightly to try to catch regressions is a fine start, although I am curious how many of those unit tests would catch useful failures and how much would be noise due to them actually being integration / end to end tests).

What can I do here to better help and support the work?

Ideally the jobs could be voting and not cause noise. It would be a lightweight alternative to gate-and-submit which tests everything for every patch.

As a start, I think it could be done as experimental and non-voting with a small set of extensions like ReadingLists.

Maybe once the tests for a particular extension are working well, then can be made voting while keep others experimental/non-voting.

Having daily selenium (or playwright etc) tests would also be great, running against catalyst and high-level UI tests are also great.

The purpose of this task is that we want to catch when the phpunit/integration tests break in extensions like ReadingLists because of a change in MediaWiki core. We don't care to have Wikibase etc. tests run for ReadingLists patches and vice versa, but only the things that are actually related or would impact ReadingLists.

Having daily selenium (or playwright etc) tests would also be great, running against catalyst and high-level UI tests are also great.

The purpose of this task is that we want to catch when the phpunit/integration tests break in extensions like ReadingLists because of a change in MediaWiki core. We don't care to have Wikibase etc. tests run for ReadingLists patches and vice versa, but only the things that are actually related or would impact ReadingLists.

This makes sense to me. My team is taking on crafting better test automation over the next quarter as part of our work so we can associate this with T422555, which will track the broader work adding critical user journeys.