Page MenuHomePhabricator

Proposal: Effective immediately, disallow multi-sync patch deployment
Closed, ResolvedPublic

Description

Based on past incidents caused by deployments consisting of multiple scap sync-file commands, I'd like to propose that effective immediately we no longer allow SWAT deployment of patches in multiple syncs.

Note: This isn't as extreme as it may sound, allow me to explain :-)

Preamble
  • When we have a MediaWiki patch that touches multiple files in an extension, we typically deploy that through a single command, using scap sync-file php-$version/extensions/Example. This is fine.
  • When we have a MediaWiki patch that touches multiple files in core that have a not-too-big common parent directory, we typically deploy that through a single command, using scap sync-file php-$version/resources/src/example. This is fine.
  • When we have a config patch that touches a single file in wmf-config, we typically deploy that through a single command, using scap sync-file wmf-config/example.php. This is fine.

These are fine, because we test them in a way that is representative of the state we will create in production. Here is why:

Amble

Config patches in Gerrit have Jenkins tests run before merging. These tests do more than validate basic PHP syntax. They also validate run-time requirements, such as referenced classes, functions, variables and dblists existing. We actually have tests for this, and they work!

However, if we deploy a patch in two steps, then the tests were not representative of the state between those deployments. If a patch is intended to be deployed in two steps, I propose the patch MUST be rejected by SWAT team and be split up first. That way, there is a one-to-one mapping between patches in Gerrit and actual deployments.

Doing so eliminates the entire concept of outages caused by problems we have tests for.

In addition to Jenkins tests, we have one more step before a patch is applied to servers with production traffic. Namely, the mwdebug sync. This environment is intended to be identical to a production web server. And for all intends and purposes, it is. However, we do not deploy code to mwdebug servers the same way we deploy code to production web servers. Scap does not support sync-file to a single server. Instead, we use scap pull from the mwdebug server which synchronises all code bases in their entirety.

Testing on mwdebug servers is unrepresentative for patches that will be deployed in multiple steps. The state in-between is not validated on mwdebug. This means that even if Jenkins test coverage is insufficient, and/or if Scap is unable to detect the outage on canaries, our manual verification step is also not meaningfull unless we forbid partial deployment of patches.

Proposal

I assume that we are not yet ready to require full scaps for all patch deployments. As such, I propose that at least require all patches to be deployed with a single command. This can be sync-file or full scap.

This means that a patch changing multiple files that depend on each other, must be split. E.g. separate patches where each stage is valid and can be verified by Jenkins, and through mwdebug.

Event Timeline

Krinkle renamed this task from Proposal: Effective immediately, disallow multi-file or multi-dir patch deployment to Proposal: Effective immediately, disallow multi-sync patch deployment.Feb 20 2018, 4:17 AM

(removed sync-dir references in description since it's just a deprecated back-compat alias to sync-file)

I agree with this proposal. In fact, let's just go ahead and make this Official™ -- all we'd need to do is make some doc changes and announce it.

The proposal I'm also pushing for extension deploys also helps avoid these race conditions (409750) when moving extensions from beta->prod.

How would one deploy 421093?

It changes static/images/project-logos/ and wmf-config/InitialiseSettings.php.

Like this?

zfilipin@tin:/srv/mediawiki-staging$ scap sync-file . 'message'

How would one deploy 421093?

It changes static/images/project-logos/ and wmf-config/InitialiseSettings.php.

Like this?

zfilipin@tin:/srv/mediawiki-staging$ scap sync-file . 'message'

Per IRC: no, you would not do that. That would trigger a lint on all files only to sync 4. The "correct" way (this bug notwithstanding):

scap sync-file static/images/project-logos/
scap sync-file wmf-config/InitialiseSettings.php

@zeljkofilipin In my opinion, the correct way to deploy that would indeed be an (inefficient) full sync as suggested.

Although given it is inefficient, it would make more sense to deploy it separately as @demon shows. The problem is, getting the order wrong can break things, and should not be a risk you as deployer should have to take.

Instead, this task mandates that you, as SWAT deployer, must refuse to deploy unsafe commits like this. Instead, the commit should be split in two. The order will naturally be enforced by the git-commit parent relationship (or manually, with Depends-On).

  1. Commit A: Add files to static/images/project-logos/. Merge, verify on mwdebug, deploy with minimal scap sync-file.
  2. Commit B (depends on A): Add reference to file in InitialiseSettings.php. Merge, verify on mwdebug, deploy with minimal scap sync-file.
greg triaged this task as Medium priority.Apr 23 2018, 4:58 PM
greg subscribed.

+1 let's document this...

For those not very tech-savy and whose language is not English, could you please provide an example of a patch that falls into this category (now forbidden) and another example of an allowed patch? Thanks.

@MarcoAurelio Okay. Here is a simpler version of "what" and "how", without "why".

It is forbidden to deploy a Gerrit change for wmf-config that creates something new and also uses it. It is also forbidden to deploy a Gerrit change that removes something and removes its use at the same time.

Example of bad changes:

  • Creating a new dblist (example.dblist) and changing PHP configuration to use 'wgUseExample' => [ 'example' => true ].
  • Removing a dblist (example.dblist) and also removing PHP configuration for 'wgUseExample' => [ 'example' => true ].
  • Adding a new static image (project-logos, favicon, ..) and changing PHP configuration to enable the file somewhere.

We ask the change author (you) to split the change. Example of good changes:

  • Gerrit change 1: Create new dblist. Gerrit change 2: Set wgUseExample.
  • Gerrit change 1: Remove wgUseExample. Gerrit change 2: Remove dblist.
  • Gerrit change 1: Add new static image. Gerrit change 2: Update config to use the image.

Both changes must be added to Deployments page for SWAT, changes will be verified (on mwdebug) and deployed one before the other.

Okay. Thanks for explaining @Krinkle

Does that mean that I have to squash https://gerrit.wikimedia.org/r/#/c/429385/?

Isn't this quite in contradiction to the new rule that forbids us to schedule more than 6 patches for SWAT windows (instead of 8?)

We have to submit more patches for the same changes but with less patches allocated for every SWAT window :-/

Okay. Thanks for explaining @Krinkle

Does that mean that I have to squash https://gerrit.wikimedia.org/r/#/c/429385/?

It seems this commit does not create a new dependency on dblist contents. As far as I know, if those config changes would apply first (without the dblist change), there would only be an unrecognised wiki ID for a few config variables in InitialiseSettings.php which is fine, and doesn't affect other wikis.

About the images, those should in a separate commit, deployed first. That would be consistent and always safe.

If we look more closely at this very specific commit for id.wikimedia.org, I think it would be fine to deploy at once. But that's because I'm spending 15 minutes looking very closely at some details. Namely: 1) It's a new wiki, so unlikely someone would mind it being broken for a few minutes. 2) It's a new image file, which means worse-case scenario if it gets requested too early, it will get 404 Not Found, which we cache in Varnish for 5 minutes and then automatically purge. 3) The image file is only used client-side, not server-side, so there would be no chance of cascading failure into other wikis.

But, this is the equivalent of walking a blurry line between safe and unsafe and asking for case-by-case exceptions, which we can't do for every patch, and might cause confusion if next time you also think it's safe for similar commits that are actually different in some way.

Isn't this quite in contradiction to the new rule that forbids us to schedule more than 6 patches for SWAT windows (instead of 8?)

I can see how it would seem that way, but as I understand it these are unrelated concerns. I can't find the reason right now, but from my memory, I believe the reason was that there was not enough time to do 8 patches in 1 hour (based on time to Jenkins test, Git merge, debug staging, debug verification, and actual scap deploy). It was shortened to avoid increasing the burden on SWAT deployers. I guess the conversation about this will continue at T193475.

Creating a new wiki need a full Scap in any case as well IIRC, so I am inclined to keep the patch as-is in this case (I think they want to avoid full Scaps after all during SWATs right? But this is not SWATable). The discussion it seems it won't happen.

I think they want to avoid full Scaps after all during SWATs right?

Nope, that's never been the policy. And considering we've (very) slowly been moving towards full scap every time, it's even less true :)