Implement MediaWiki pre-promote checks
Open, NormalPublic

Description

Problem

When deploying files in /srv/mediawiki to production, errors can be introduced due to any number of reasons. It is not feasible to catch all possible error scenarios in unit testing, integration testing, beta testing, or even mwdebug testing.

As final layer of defence, scap should perform a handful of pre-promote checks that must pass without which a deployment must not be allowed to proceed to production servers.

Existing work

We have php syntax linting as part of the merge pipeline in mediawiki/core, in extension repos, and in the operations/mediawiki-config repo. Yet, due to various reasons, a syntax error can still enter production. Due to differences in PHP versions, or due to the linter being invoked on a subdirectory and not all files, or due to the linter ignoring symlinks or otherwise files being excluded. There are also private files in the deployment host. As final check, Scap already performs one additional lint check.

We also have beta cluster, which should in theory catch all run-time issues aside from production-specific configuration overrides and private files. Although in practice:

  1. Don't check health of beta cluster during deployment.
  2. Aside from configuration, Beta Cluster runs on a different branch (master vs wmf). Issues specific to backports would not be detected there. And even if the final state after a deployment were the same on Beta and in production, for Beta we typically merge and deploy multiple commits at once. Whereas in production we deploy 1 commit and 1 file/directory at once, which means there is another area of things to go wrong. And the below list of incidents indicates that this is quite a common source of errors (whether to split a deploy, and if so, what the correct order is).

We also have debug servers, which in theory should catch all run-time issues, including anything production specific. Although in practice:

  1. Staging of changes on mwdebug happens through scap pull, which updates the entire deployment in one (nearly) atomic step. Whereas for production, each file and directory is synced separately. The same common source of errors mentioned above (Beta cluster, #2) also is not caught on mwdebug servers.
  2. Staging on mwdebug uses scap pull, which does not rebuild l10n locally - localisation-related issues are thus often missed on mwdebug servers.

We also have canary servers, which have been introduced after the creation of this task. Files are deployed there before the rest of production. The method of deployment and content matches 100%. However, the method of monitoring is insufficient. We currently monitor the canary servers through logstash, by querying various channels and do a before/after comparison. In theory this should catch all significant issues affecting production. Both that affect all requests, and even those that affect a large number requests. Although in practice, we still have incidents with fatal error or php notice affecting all pages on all wikis.

Various reasons at different times:

  • Due to php notices not being included in the logstash query.
  • Due to fatal errors not being logged.
  • Due to fatal errors not being included correctly in the logstash query.
  • Due to the before/after comparison not working as expected.

Solution

The idea is to write a simple maintenance script that simulates a small number of requests. We'd require that script to pass without any errors (notices/warnings) or exceptions from PHP.

It could be run on tin, e.g. sync to local /srv/mediawiki on tin first, run the script and then continue sync if successful.

Benefits

Eliminate a large class of errors:

  • Subtle errors in functional syntax. Not strict parse error, but fatal on runtime, such as the infamous arrray() typo.
  • PHP notices or warnings affecting all views (e.g. a mistyped variable in wmf-config, or in global MediaWiki code).
  • PHP fatal exceptions that happen on all views.
  • Any of the above, as result of files syncs not being split, or being synced in the wrong order.

The core idea was previously implemented in 2013 at https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php, we can further develop it by also catching warnings and exceptions.

A basic set of checks, run once, against the staging server (e.g. tin) would catch 99% of cases where a PHP notice or fatal error happens on a common web request.

Implementation ideas

  1. MediaWiki maintenance script: Simple no-op.
    • Invoke echo 1 | mwscript eval.php --wiki enwiki;, this would already exercise all wmf-config code, which is the most common source of errors. Run-time errors from MediaWiki core or extension code is rare, especially given Jenkins now catches these pre-merge.
    • Stderr will contain PHP notices, warnings, fatals, and MediaWiki warnings/errors.
  2. MediaWiki maintenance script: Render enwiki Main Page
    • Like the old sanityCheck.php script, we could instantiate MediaWiki programmatically to execute one ore more common actions. E.g. render Main Page view, Main Page history, and a basic Api query with FauxRequest.
  3. HTTP-based
    • If we can make sure Apache is installed and working on the deployment host, we can do a few simple HTTP requests as well. E.g. implemented as a python plugin for scap, that makes a set of HTTP requests to the local deployment host and verify the response. This has the benefit of catching all fatal errors no matter whether they come from apache/hhvm/mediawiki by simply checking the http status code and response body (unlike T154646, T142784 - which missed an obvious HTTP 5xx error no all pages).

The downside of the HTTP-based approach is that it wouldn't catch php notices/warnings that don't make the response fail, but would certainly flood the logs (unless we tail the local logs somehow).

My recommendation would be to start with point 1. And then in addition maybe point 2 or point 3.

Preventable incidents

  • 20151005-MediaWiki - Fatal exception everywhere due to undefined function - aka arrray() error in wmf-config
  • 20151026-MediaWiki - Fatal exception, from MediaWiki/ResourceLoader.
  • 20160212-MediaWiki - Fatal exception from PHP (file missing, synced in the wrong order)
  • 20160222-MediaWiki - HTTP 404 Not Found everywhere due to accidental removal of mediawiki docroot symlink
  • 20160407-MediaWiki - fatal due to invalid PHP expression caused by a bad config change
  • 20160601-MediaWiki - Fatal exception everywhere due to typo in extension name being loaded.
  • 20160713-MediaWiki - Fatal exception on certain (not all) page actions due to misnamed PHP file inclusion.
  • 20170104-MediaWiki - Fatal exception everywhere due to invalid return value from a callback in wmf-config.
  • 20170111-MediaWiki - Fatal exception from multiversion.
  • 20170124-MediaWiki - Fatal exception on all wikis, from WikibaseClient.
  • 20180129-MediaWiki - Fatal exception everywhere, from wmf-config.

See also

Related Objects

Krinkle created this task.Dec 15 2015, 11:31 PM
Krinkle updated the task description. (Show Details)
Krinkle raised the priority of this task from to Needs Triage.
Krinkle added projects: Deployments, Scap.
Krinkle added a subscriber: Krinkle.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 15 2015, 11:31 PM
greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:33 PM
dduvall renamed this task from Require sanity test to pass before syncing files to all web servers to Implement MediaWiki pre-promote checks.Feb 12 2016, 7:42 PM
dduvall triaged this task as Normal priority.
dduvall set Security to None.
dduvall moved this task from Needs triage to MediaWiki MVP on the Scap board.
Krinkle added a comment.EditedFeb 22 2016, 9:37 PM

Added some incidents to task description that a scap with sanity-check would've prevented.

mmodell edited projects, added Scap (Scap3-MediaWiki-MVP); removed Scap.
MaxSem added a subscriber: MaxSem.Jun 2 2016, 11:11 PM
Krinkle updated the task description. (Show Details)Jul 6 2016, 11:46 PM
Krinkle updated the task description. (Show Details)Jul 8 2016, 1:15 AM
greg added a subscriber: greg.Sep 29 2016, 7:41 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

Krinkle updated the task description. (Show Details)Jan 6 2017, 10:09 PM
Krinkle updated the task description. (Show Details)Jan 31 2018, 1:04 AM
Krinkle updated the task description. (Show Details)Jan 31 2018, 1:37 AM
Krinkle updated the task description. (Show Details)Jan 31 2018, 2:10 AM
thcipriani closed this task as Resolved.May 8 2018, 7:58 PM
thcipriani claimed this task.
thcipriani added a subscriber: thcipriani.

Now that Scap 3.8.1-1 is out we are checking the swagger spec at https://en.wikipedia.org/spec.yaml via service-checker-swagger against the canary servers before deploying more widely. The current checks are very wikipedia-centric and should be expanded on; however, this change should provide a good starting point and a good-enough to call this task complete.

MaxSem added a comment.May 8 2018, 8:04 PM

https://en.wikipedia.org/spec.yaml

Already out of date: version: 1.31.0 :)

https://en.wikipedia.org/spec.yaml

Already out of date: version: 1.31.0 :)

good looking out :P

I guess the service checker isn't picky about versions, so that's good.

For a quick demo of what scap is doing, effectively:

#!/usr/bin/env bash                                                                                                                                   

while read host; do
        service-checker-swagger "$host" http://en.wikipedia.org -s "/spec.yaml" &
done < <(cat /etc/dsh/group/mediawiki-api-canaries \
        /etc/dsh/group/mediawiki-appserver-canaries | \
        grep '^mw')

wait

Running that on tin you get:

All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy
All endpoints are healthy
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy

@thcipriani Does this mean that writing a patch replacing /w/index.php in operations/mediawiki-config with die(1); can be synced with scap sync-file in a way that (once that command finishes), no production traffic remains broken?

In other words, does it run the spec against localhost on the deployment host before syncing to canaries, or it somehow runs the spec on canaries without applying to production traffic, or it somehow undoes the patch on canaries?

For the completion of this task, I would say that (while suboptimal) it would be fine if it depends on breaking the canary traffic for a few seconds. We can obviously avoid that if we ran it on the deployment host (or a debug host) instead, but we can do that later.

But if we still leave the canaries in a broken state, requiring manual reverting and re-deploying of a patch, then the task should be re-opened.

thcipriani reopened this task as Open.May 10 2018, 3:02 PM
thcipriani removed thcipriani as the assignee of this task.

In other words, does it run the spec against localhost on the deployment host before syncing to canaries, or it somehow runs the spec on canaries without applying to production traffic, or it somehow undoes the patch on canaries?

Apache on the deployment hosts isn't configured as a MediaWiki host, so we'd have to update puppet config in order to run these checks against localhost.

This is a check against canary hosts and requires a manual revert.

Thanks for re-opening.

I'm a fan of Scap's logstash checks and catching any size regression that affects real requests, including cases we might not think to test ourselves. There's a lot of potential on top of that, as well. The set of rules can become smarter and stricter over time. I don't know if is planned to also prevent fatal errors, though. It's possible, but might require a few things:

  • Creation of a canary pool. (done)
  • Implement logstash query and check. (done, but at-risk: data sources and channels evolve; T154646, T162974, and soon https://gerrit.wikimedia.org/r/338911)
  • Ability to reliably compare before/after state from logstash. (done, but has issues, T183999; it's difficult to tolerate random fatals and also zero-tolerate new ones)
  • Ability to repool/depool. (done, but not yet used after detecting regressions in the canary pool)
  • Ability to rollback automatically. (todo, and difficult in current setup)
  • Ability to run basic tests without affecting production traffic (todo, e.g. against debug host, or local CLI)

It'd be awesome to have all this, but given its difficulties and risks (variable traffic, logstash query, comparison) the swagger specs would better fit the goal of prevention (not detection). But as @thcipriani points out, that requires MediaWiki app server roles on the deploy host. I don't know if that's a problem, but if so, it might be possible to first use an mwdebug server. We'll also need Scap to fetch error logs from Logstash after the swagger checks, and assert there were no warning/errors/fatals in any hhvm/mediawiki channel. (The urls in our swagger spec are known to pass without errors, enforced in Jenkins per T50002).

A shell out for echo 1 | mwscript eval.php that catches fatals and non-empty php-stderr on the deploy host (without app-server roles or Logstash queries) could've prevented "all-wiki" fatals and logspam incidents without affecting traffic (incl canaries).

Whether with mwscript, or with local swagger specs and a logstash query, or something else – I'm hoping we can soon prevent fatal errors :)

Krinkle added a comment.EditedSep 20 2018, 8:12 PM

Ping. The above one-liner would've prevented the following from affecting production traffic to canary hosts today.

https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461707/
> PHP Notice: Undefined variable: wgSiteMatrixNonGlobalSites

This affected all requests/jobs on canaries. This is low-hanging fruit that can be easily and reliably caught without affecting production first (canaries are production).

The logstash-based queries should imho be considered as helping catch less common errors (errors affecting a subset of users that we can't anticipate through hardcoded pre-promote checks). It is not reliable as first layer of defence because it is inherently fragile (T204907, T200960, T183999, T154646, T142784, T165675), is not automatically rolled back, and we shouldn't use production to check for errors that affect every request.

Alternatively, if we prefer HTTP instead of CLI, we can do something like this:

  • Have swagger checks set a Header that wmf-config will use to turn all PHP errors into fatals.
  • Run swagger checks before serving production traffic (e.g. from deployment host or mwdebug).

But either way, I'd very much like to achieve what we set out in 2015, which is that "obvious errors that affect all requests to all wikis, must be impossible to deploy through normal means in a way that breaks production traffic".

Running as much as echo 1 | mwscript eval.php on the deployment host today would achieve that in a reliable way without dependencies. Running swagger checks (on non-prod hosts) with a way to make errors emit HTTP errors would also work but is more work to setup.

I'll take a stab at calling echo 1 | mwscript eval.php in scap.

This is now deployed to production with scap 3.8.7-1!

scap-book

Mentioned in SAL (#wikimedia-releng) [2018-10-09T18:19:40Z] <Krinkle> Messing with scap in beta to test T121597 / D1114

Looks like it might not be working as intended.

Here's what I did:

  • Disabled beta-scap-eqiad.
  • ssh deployment-deploy01, sudo -iu jenkins-deploy, cd /srv/mediawiki-staging
  • (jenkins-deploy) vim wmf-config/CommonSettings.php # add an empty if-block on line 2 that evaluates an undefined variable, like if ($myUndefined) {}.
  • (jenkin-deploy) scap sync-file wmf-config/CommonSettings.php 'this should fail'
attempt 1
jenkins-deploy@deployment-deploy01:/srv/mediawiki-staging$ vim wmf-config/CommonSettings.php     
jenkins-deploy@deployment-deploy01:/srv/mediawiki-staging$ scap sync-file wmf-config/CommonSettings.php 'this should fail'
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

18:35:40 Checking for new runtime errors locally
18:35:42 Copying to deployment-deploy01.deployment-prep.eqiad.wmflabs from deployment-deploy01.deployment-prep.eqiad.wmflabs
18:35:42 Started rsync common
18:35:42 Finished rsync common (duration: 00m 00s)
18:35:42 Started cache_git_info
18:36:28 Finished cache_git_info (duration: 00m 45s)
18:36:28 Started sync-masters
sync-masters: 100% (ok: 1; fail: 0; left: 0)                                    
18:36:37 Finished sync-masters (duration: 00m 08s)
18:36:37 Started sync-pull-masters
sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)                               
18:36:39 Finished sync-pull-masters (duration: 00m 02s)
18:36:39 Started sync-check-canaries
check-canaries: 100% (ok: 1; fail: 0; left: 0)                                  
18:36:44 Finished Canaries Synced (duration: 00m 04s)
18:36:44 Executing check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
18:36:46 Finished Canary Endpoint Check Complete (duration: 00m 01s)
18:36:46 Waiting for canary traffic...
18:37:04 Executing check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
18:37:04 Finished sync-check-canaries (duration: 00m 24s)
18:37:04 Started sync-proxies
18:37:04 Job ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/CommonSettings.php', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.
18:37:04 Finished sync-proxies (duration: 00m 00s)
18:37:04 Started sync-apaches
sync-apaches: 100% (ok: 3; fail: 0; left: 0)                                    
18:37:08 Finished sync-apaches (duration: 00m 04s)
18:37:08 Synchronized wmf-config/CommonSettings.php: this should fail (duration: 01m 28s)

It deployed successfully, which is unexpected. I suspected maybe that the bug was that the local check is against /srv/mediawiki instead of /srv/mediawiki-wiki before the local self-sync, so I manually checked after the first attempt to see if the error shows up, and it does.

echo 1 | mwscript eval.php --wiki enwiki > /dev/null
[Tue Oct  9 18:43:42 2018] [hphp] [26341:7f35d5ecf3c0:0:000001] [] 
Notice: Undefined variable: krinkle2Undefined in /srv/mediawiki/wmf-config/CommonSettings.php on line 4

If the issue is due to the order of syncs and checks issue, then presumably, given I can see the issue manually, a second sync would catch it (whilst normally that'd be too late). However, even after a second attempt, it still passes.

attempt 2
jenkins-deploy@deployment-deploy01:/srv/mediawiki-staging$ scap sync-file wmf-config/CommonSettings.php 'the previous should have failed but this definitely'
           ___ ____
         ⎛   ⎛ ,----
          \  //==--'
     _//|,.·//==--'    ____________________________
    _OO≣=-  ︶ ᴹw ⎞_§ ______  ___\ ___\ ,\__ \/ __ \
   (∞)_, )  (     |  ______/__  \/ /__ / /_/ / /_/ /
     ¨--¨|| |- (  / ______\____/ \___/ \__^_/  .__/
         ««_/  «_/ jgs/bd808                /_/

18:43:59 Checking for new runtime errors locally
18:44:01 Copying to deployment-deploy01.deployment-prep.eqiad.wmflabs from deployment-deploy01.deployment-prep.eqiad.wmflabs
18:44:01 Started rsync common
18:44:01 Finished rsync common (duration: 00m 00s)
18:44:01 Started cache_git_info
18:44:45 Finished cache_git_info (duration: 00m 43s)
18:44:45 Started sync-masters
sync-masters: 100% (ok: 1; fail: 0; left: 0)                                    
18:44:53 Finished sync-masters (duration: 00m 08s)
18:44:53 Started sync-pull-masters
sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)                               
18:44:55 Finished sync-pull-masters (duration: 00m 02s)
18:44:55 Started sync-check-canaries
check-canaries: 100% (ok: 1; fail: 0; left: 0)                                  
18:45:00 Finished Canaries Synced (duration: 00m 04s)
18:45:00 Executing check 'Check endpoints for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
18:45:02 Finished Canary Endpoint Check Complete (duration: 00m 01s)
18:45:02 Waiting for canary traffic...
18:45:20 Executing check 'Logstash Error rate for deployment-mediawiki-07.deployment-prep.eqiad.wmflabs'
18:45:20 Finished sync-check-canaries (duration: 00m 24s)
18:45:20 Started sync-proxies
18:45:20 Job ['/usr/bin/scap', 'pull', '--no-update-l10n', '--include', 'wmf-config', '--include', 'wmf-config/CommonSettings.php', 'deployment-deploy01.deployment-prep.eqiad.wmflabs', 'deployment-deploy02.deployment-prep.eqiad.wmflabs', 'deployment-deploy01.deployment-prep.eqiad.wmflabs'] called with an empty host list.
18:45:20 Finished sync-proxies (duration: 00m 00s)
18:45:20 Started sync-apaches
sync-apaches: 100% (ok: 3; fail: 0; left: 0)                                    
18:45:25 Finished sync-apaches (duration: 00m 05s)
18:45:25 Synchronized wmf-config/CommonSettings.php: the previous should have failed but this definitely (duration: 01m 26s)

It still passed?

@Krinkle: I'm digging into this, trying to figure out what could cause it to miss the error.

Update: @thcipriani and I just did a pairing session to get to the bottom of this. Our conclusion: something in scap or the environment is eating stderr. The exact same code produces the expected results when we run it from outside of scap's environment. We ran out of time for our pairing session but I'm going to dig into scap's logging code a bit further this afternoon.

Looks like it might not be working as intended.

Here's what I did:

  • Disabled beta-scap-eqiad.
  • ssh deployment-deploy01, sudo -iu jenkins-deploy, cd /srv/mediawiki-staging
  • (jenkins-deploy) vim wmf-config/CommonSettings.php # add an empty if-block on line 2 that evaluates an undefined variable, like if ($myUndefined) {}.
  • (jenkin-deploy) scap sync-file wmf-config/CommonSettings.php 'this should fail'

Shouoldn't that be wmf-config/CommonSettings-labs.php?

@mmodell Both are actually loaded in Beta Cluster, and should be. The -labs file is a light layer of overrides atop the main file. See also file comment.

Had some time to play with this today.

Steps

  1. Disable beta-scap-eqiad
  2. Modifying /srv/mediawiki-staging/wmf-config/CommonSettings-labs.php adding the following to line 358:
if ($tylerIsDoingAThing) {}

Investigation/conclusion

We're not seeing the output on stderr in scap because there is no output on stderr using PHP=php7.0. Scap is using php7.0 to regenerate l10n files since hhvm takes 40 minutes (T191921)

[thcipriani@deployment-deploy01 ~]$ echo 'echo("hello");' | mwscript eval.php --wiki=enwiki
[Fri Oct 19 18:23:14 2018] [hphp] [12691:7fe71f9eb3c0:0:000001] []
Notice: Undefined variable: tylerIsDoingAThing in /srv/mediawiki-staging/wmf-config/CommonSettings-labs.php on line 359
hello

[thcipriani@deployment-deploy01 ~]$ echo 'echo("hello");' | PHP=php7.0 mwscript eval.php --wiki=enwiki
hello

Change 468626 had a related patch set uploaded (by 20after4; owner: 20after4):
[mediawiki/core@master] eval.php: write error messages to stderr

https://gerrit.wikimedia.org/r/468626

@thcipriani Per T121597#4652873, the errors were written to stderr over plain mwscript eval.php on the command-line (with HHVM). Today, this is no longer the case. Both in beta and in prod, both with PHP 7 and with HHVM, the errors are not written to standard out by default.

I believe this is because $wgPropagateErrors = false; (change), which prevent the default handler from running. When I evaluate $wgPropagateErrors = true; before if ($myUndefined) {}, I get the stderr again (with HHVM).

With PHP 7, it still didn't happen. That's likely because our PHP 7 install is very fresh and has mostly default or different INI settings from what we have with HHVM and previously had with PHP 5. In particular, error_reporting appears to be set in a way on PHP 7 to ignore most errors by default.

When evaluating $wgPropagateErrors = true; error_reporting(E_ALL) before if ($myUndefined) {}, I get the stderr again with PHP 7 as well.

In summary:

  • PHP 7's INI settings need to be check to better match our HHVM settings, in particular error_reporting appears to be partly disabled on PHP 7 currently.
  • $wgPropagateErrors was recently disabled on WMF, which is good for web requests (so that we don't report to Logstash the same error both on type:hhvm and on type:mediawiki channel:error). But not so good for manual use via eval.php. We should either override $wgPropagateErrors in eval.php, or find some other way to print them to stderr by default.
mmodell added a comment.EditedOct 19 2018, 9:05 PM

@Krinkle: nice find! That explains all of the inconsistencies that @thcipriani and I kept running into during our testing. It was thoroughly confusing, especially when we were suspecting scap, rather than php, as the culprit.