The problem
Currently, CI does not always prevent changes in MW core from breaking something in an extension, or changes in one extension from breaking something in another extension. This is because our existing mechanisms for cross-repo testing don't always cover these cases:
- The "shared gate" job runs for "gated" repos. Whenever a change is proposed or +2ed in a gated repo, this job runs the unit tests for all other gated repos.
- The regular CI job that runs for all repos only runs the tests for explicitly declared dependencies of that repo.
The set of gated repos currently includes MW core, 51 extensions and 2 skins. Because of the shared gate job, it's impossible for a change in any of these repos to break any of the others, and in particular it's impossible for a change in MW core to break one of the gated extensions/skins' PHPUnit or QUnit tests.
However, there are an additional 142 extensions and 6 skins that are used in WMF production but are not in the shared gate. The CI jobs for MW core do not run the tests for these repos, so it's possible for a change in MW core to break one of these repos without CI noticing(*). This happens with some frequency, see e.g T404229: ReadingLists test failures, T375543: CI failing for ReadingLists extension, T400549: CI failure: LogicException: Instances of MediaWiki\HookContainer\HookContainer are not serializable!, T376000: ArticlePlaceholder test fails in CI on empty GrowthExperiments change on master. In fact, there's a whole Phab project dedicated to these problems: ci-test-error (WMF-deployed Build Failure) (**), where 103 tasks were filed and resolved in the past year (though not all of these were caused by this issue). In addition, despite the shared gate job, it's also possible for a change in MW core to break phan in extensions, e.g. T401468: Phan failures in Wikibase after Status and StatusValue documented as generic (templated) classes. All of these things cause confusion among engineers working on the now-broken extensions, and even get in the way of fixing UBNs from time to time.
(*) It's also theoretically possible for a change in one extension to break another, but that's less common, and in cases where that's likely to happen in practice, a dependency relationship is usually already in place. This task focuses on the more common problem of MW core changes breaking deployed extensions/skins.
(**) Most of the open tasks in this project are about flaky Selenium tests. To see more examples of cross-repo unit test or Phan failures, look at resolved tasks in this project.
Why this is hard
Previously, the Release-Engineering-Team tried to fix this problem by adding all production extensions+skins to the shared gate (T249674: Have all Wikimedia production extensions and skins in the CI gate), but partway through that it became clear that that was not sustainable. Right now only about 1/4th of the production extensions+skins are gated, and already the shared gate jobs are pretty slow: ~16 minutes for PHPUnit+QUnit, and 23 minutes for Selenium (these two run in parallel). Adding even more repos to the gate would further slow these jobs, and also make them more fragile. We could try to invent different tools than the current concept of a shared gate, but they would still have to contend with the fact that running tests on 201 repos would be slower and more fragile than running them on the current 53.
We also don't actually have all of these extension installed on every production wiki. There are pairs of extensions that are never installed together, and there are extensions (like CentralAuth) that, when installed, make it impossible to test certain parts of MW core that we do run in production. So just installing all the extensions and running the tests is not necessarily a silver bullet, or even a good idea.
In the meantime, the recent discussions on T403560: ReadingLists tests not run in jenkins wmf-quibble-core-vendor-mysql-php81 and T404058: Add MetricsPlatform to gated extensions have exposed that there's no clear policy for which repos should be gated, and a number of repos that are currently in the gate probably wouldn't be let in if they requested it today. While it may make sense to decline those individual tasks in the current context, they're symptoms of a broader problem that this task attempts to describe.