Based on past incidents caused by deployments consisting of multiple scap sync-file commands, I'd like to propose that effective immediately we no longer allow SWAT deployment of patches in multiple syncs.
Note: This isn't as extreme as it may sound, allow me to explain :-)
- When we have a MediaWiki patch that touches multiple files in an extension, we typically deploy that through a single command, using scap sync-file php-$version/extensions/Example. This is fine.
- When we have a MediaWiki patch that touches multiple files in core that have a not-too-big common parent directory, we typically deploy that through a single command, using scap sync-file php-$version/resources/src/example. This is fine.
- When we have a config patch that touches a single file in wmf-config, we typically deploy that through a single command, using scap sync-file wmf-config/example.php. This is fine.
These are fine, because we test them in a way that is representative of the state we will create in production. Here is why:
Config patches in Gerrit have Jenkins tests run before merging. These tests do more than validate basic PHP syntax. They also validate run-time requirements, such as referenced classes, functions, variables and dblists existing. We actually have tests for this, and they work!
However, if we deploy a patch in two steps, then the tests were not representative of the state between those deployments. If a patch is intended to be deployed in two steps, I propose the patch MUST be rejected by SWAT team and be split up first. That way, there is a one-to-one mapping between patches in Gerrit and actual deployments.
Doing so eliminates the entire concept of outages caused by problems we have tests for.
In addition to Jenkins tests, we have one more step before a patch is applied to servers with production traffic. Namely, the mwdebug sync. This environment is intended to be identical to a production web server. And for all intends and purposes, it is. However, we do not deploy code to mwdebug servers the same way we deploy code to production web servers. Scap does not support sync-file to a single server. Instead, we use scap pull from the mwdebug server which synchronises all code bases in their entirety.
Testing on mwdebug servers is unrepresentative for patches that will be deployed in multiple steps. The state in-between is not validated on mwdebug. This means that even if Jenkins test coverage is insufficient, and/or if Scap is unable to detect the outage on canaries, our manual verification step is also not meaningfull unless we forbid partial deployment of patches.
I assume that we are not yet ready to require full scaps for all patch deployments. As such, I propose that at least require all patches to be deployed with a single command. This can be sync-file or full scap.
This means that a patch changing multiple files that depend on each other, must be split. E.g. separate patches where each stage is valid and can be verified by Jenkins, and through mwdebug.