Implement MediaWiki pre-promote checks
Open, NormalPublic

Description

Problem

When deploying files in /srv/mediawiki to production, errors can be introduced due to any number of reasons. It is not feasible to catch all possible error scenarios in unit testing, integration testing, beta testing, or even mwdebug testing.

As final layer of defence, scap should perform a handful of pre-promote checks that must pass without which a deployment must not be allowed to proceed to production servers.

Existing work

We have php syntax linting as part of the merge pipeline in mediawiki/core, in extension repos, and in the operations/mediawiki-config repo. Yet, due to various reasons, a syntax error can still enter production. Due to differences in PHP versions, or due to the linter being invoked on a subdirectory and not all files, or due to the linter ignoring symlinks or otherwise files being excluded. There are also private files in the deployment host. As final check, Scap already performs one additional lint check.

We also have beta cluster, which should in theory catch all run-time issues aside from production-specific configuration overrides and private files. Although in practice:

  1. Don't check health of beta cluster during deployment.
  2. Aside from configuration, Beta Cluster runs on a different branch (master vs wmf). Issues specific to backports would not be detected there. And even if the final state after a deployment were the same on Beta and in production, for Beta we typically merge and deploy multiple commits at once. Whereas in production we deploy 1 commit and 1 file/directory at once, which means there is another area of things to go wrong. And the below list of incidents indicates that this is quite a common source of errors (whether to split a deploy, and if so, what the correct order is).

We also have debug servers, which in theory should catch all run-time issues, including anything production specific. Although in practice:

  1. Staging of changes on mwdebug happens through scap pull, which updates the entire deployment in one (nearly) atomic step. Whereas for production, each file and directory is synced separately. The same common source of errors mentioned above (Beta cluster, #2) also is not caught on mwdebug servers.
  2. Staging on mwdebug uses scap pull, which does not rebuild l10n locally - localisation-related issues are thus often missed on mwdebug servers.

We also have canary servers, which have been introduced after the creation of this task. Files are deployed there before the rest of production. The method of deployment and content matches 100%. However, the method of monitoring is insufficient. We currently monitor the canary servers through logstash, by querying various channels and do a before/after comparison. In theory this should catch all significant issues affecting production. Both that affect all requests, and even those that affect a large number requests. Although in practice, we still have incidents with fatal error or php notice affecting all pages on all wikis.

Various reasons at different times:

  • Due to php notices not being included in the logstash query.
  • Due to fatal errors not being logged.
  • Due to fatal errors not being included correctly in the logstash query.
  • Due to the before/after comparison not working as expected.

Solution

The idea is to write a simple maintenance script that simulates a small number of requests. We'd require that script to pass without any errors (notices/warnings) or exceptions from PHP.

It could be run on tin, e.g. sync to local /srv/mediawiki on tin first, run the script and then continue sync if successful.

Benefits

Eliminate a large class of errors:

  • Subtle errors in functional syntax. Not strict parse error, but fatal on runtime, such as the infamous arrray() typo.
  • PHP notices or warnings affecting all views (e.g. a mistyped variable in wmf-config, or in global MediaWiki code).
  • PHP fatal exceptions that happen on all views.
  • Any of the above, as result of files syncs not being split, or being synced in the wrong order.

The core idea was previously implemented in 2013 at https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php, we can further develop it by also catching warnings and exceptions.

A basic set of checks, run once, against the staging server (e.g. tin) would catch 99% of cases where a PHP notice or fatal error happens on a common web request.

Implementation ideas

  1. MediaWiki maintenance script: Simple no-op.
    • Invoke echo 1 | mwscript eval.php --wiki enwiki;, this would already exercise all wmf-config code, which is the most common source of errors. Run-time errors from MediaWiki core or extension code is rare, especially given Jenkins now catches these pre-merge.
    • Stderr will contain PHP notices, warnings, fatals, and MediaWiki warnings/errors.
  2. MediaWiki maintenance script: Render enwiki Main Page
    • Like the old sanityCheck.php script, we could instantiate MediaWiki programmatically to execute one ore more common actions. E.g. render Main Page view, Main Page history, and a basic Api query with FauxRequest.
  3. HTTP-based
    • If we can make sure Apache is installed and working on the deployment host, we can do a few simple HTTP requests as well. E.g. implemented as a python plugin for scap, that makes a set of HTTP requests to the local deployment host and verify the response. This has the benefit of catching all fatal errors no matter whether they come from apache/hhvm/mediawiki by simply checking the http status code and response body (unlike T154646, T142784 - which missed an obvious HTTP 5xx error no all pages).

The downside of the HTTP-based approach is that it wouldn't catch php notices/warnings that don't make the response fail, but would certainly flood the logs (unless we tail the local logs somehow).

My recommendation would be to start with point 1. And then in addition maybe point 2 or point 3.

Preventable incidents

  • 20151005-MediaWiki - Fatal exception everywhere due to undefined function - aka arrray() error in wmf-config
  • 20151026-MediaWiki - Fatal exception, from MediaWiki/ResourceLoader.
  • 20160212-MediaWiki - Fatal exception from PHP (file missing, synced in the wrong order)
  • 20160222-MediaWiki - HTTP 404 Not Found everywhere due to accidental removal of mediawiki docroot symlink
  • 20160407-MediaWiki - fatal due to invalid PHP expression caused by a bad config change
  • 20160601-MediaWiki - Fatal exception everywhere due to typo in extension name being loaded.
  • 20160713-MediaWiki - Fatal exception on certain (not all) page actions due to misnamed PHP file inclusion.
  • 20170104-MediaWiki - Fatal exception everywhere due to invalid return value from a callback in wmf-config.
  • 20170111-MediaWiki - Fatal exception from multiversion.
  • 20170124-MediaWiki - Fatal exception on all wikis, from WikibaseClient.
  • 20180129-MediaWiki - Fatal exception everywhere, from wmf-config.

See also

Krinkle created this task.Dec 15 2015, 11:31 PM
Krinkle updated the task description. (Show Details)
Krinkle raised the priority of this task from to Needs Triage.
Krinkle added projects: Deployments, Scap.
Krinkle added a subscriber: Krinkle.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 15 2015, 11:31 PM
greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:33 PM
dduvall renamed this task from Require sanity test to pass before syncing files to all web servers to Implement MediaWiki pre-promote checks.Feb 12 2016, 7:42 PM
dduvall triaged this task as Normal priority.
dduvall set Security to None.
dduvall moved this task from Needs triage to MediaWiki MVP on the Scap board.
Krinkle added a comment.EditedFeb 22 2016, 9:37 PM

Added some incidents to task description that a scap with sanity-check would've prevented.

mmodell edited projects, added Scap (Scap3-MediaWiki-MVP); removed Scap.
MaxSem added a subscriber: MaxSem.Jun 2 2016, 11:11 PM
Krinkle updated the task description. (Show Details)Jul 6 2016, 11:46 PM
Krinkle updated the task description. (Show Details)Jul 8 2016, 1:15 AM
greg added a subscriber: greg.Sep 29 2016, 7:41 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

Krinkle updated the task description. (Show Details)Jan 6 2017, 10:09 PM
Krinkle updated the task description. (Show Details)Jan 31 2018, 1:04 AM
Krinkle updated the task description. (Show Details)Jan 31 2018, 1:37 AM
Krinkle updated the task description. (Show Details)Jan 31 2018, 2:10 AM
thcipriani closed this task as Resolved.May 8 2018, 7:58 PM
thcipriani claimed this task.
thcipriani added a subscriber: thcipriani.

Now that Scap 3.8.1-1 is out we are checking the swagger spec at https://en.wikipedia.org/spec.yaml via service-checker-swagger against the canary servers before deploying more widely. The current checks are very wikipedia-centric and should be expanded on; however, this change should provide a good starting point and a good-enough to call this task complete.

MaxSem added a comment.May 8 2018, 8:04 PM

https://en.wikipedia.org/spec.yaml

Already out of date: version: 1.31.0 :)

https://en.wikipedia.org/spec.yaml

Already out of date: version: 1.31.0 :)

good looking out :P

I guess the service checker isn't picky about versions, so that's good.

For a quick demo of what scap is doing, effectively:

#!/usr/bin/env bash                                                                                                                                   

while read host; do
        service-checker-swagger "$host" http://en.wikipedia.org -s "/spec.yaml" &
done < <(cat /etc/dsh/group/mediawiki-api-canaries \
        /etc/dsh/group/mediawiki-appserver-canaries | \
        grep '^mw')

wait

Running that on tin you get:

All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy
All endpoints are healthy
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy                                                                                                                             
All endpoints are healthy

@thcipriani Does this mean that writing a patch replacing /w/index.php in operations/mediawiki-config with die(1); can be synced with scap sync-file in a way that (once that command finishes), no production traffic remains broken?

In other words, does it run the spec against localhost on the deployment host before syncing to canaries, or it somehow runs the spec on canaries without applying to production traffic, or it somehow undoes the patch on canaries?

For the completion of this task, I would say that (while suboptimal) it would be fine if it depends on breaking the canary traffic for a few seconds. We can obviously avoid that if we ran it on the deployment host (or a debug host) instead, but we can do that later.

But if we still leave the canaries in a broken state, requiring manual reverting and re-deploying of a patch, then the task should be re-opened.

thcipriani reopened this task as Open.May 10 2018, 3:02 PM
thcipriani removed thcipriani as the assignee of this task.

In other words, does it run the spec against localhost on the deployment host before syncing to canaries, or it somehow runs the spec on canaries without applying to production traffic, or it somehow undoes the patch on canaries?

Apache on the deployment hosts isn't configured as a MediaWiki host, so we'd have to update puppet config in order to run these checks against localhost.

This is a check against canary hosts and requires a manual revert.

Thanks for re-opening.

I'm a fan of Scap's logstash checks and catching any size regression that affects real requests, including cases we might not think to test ourselves. There's a lot of potential on top of that, as well. The set of rules can become smarter and stricter over time. I don't know if is planned to also prevent fatal errors, though. It's possible, but might require a few things:

  • Creation of a canary pool. (done)
  • Implement logstash query and check. (done, but at-risk: data sources and channels evolve; T154646, T162974, and soon https://gerrit.wikimedia.org/r/338911)
  • Ability to reliably compare before/after state from logstash. (done, but has issues, T183999; it's difficult to tolerate random fatals and also zero-tolerate new ones)
  • Ability to repool/depool. (done, but not yet used after detecting regressions in the canary pool)
  • Ability to rollback automatically. (todo, and difficult in current setup)
  • Ability to run basic tests without affecting production traffic (todo, e.g. against debug host, or local CLI)

It'd be awesome to have all this, but given its difficulties and risks (variable traffic, logstash query, comparison) the swagger specs would better fit the goal of prevention (not detection). But as @thcipriani points out, that requires MediaWiki app server roles on the deploy host. I don't know if that's a problem, but if so, it might be possible to first use an mwdebug server. We'll also need Scap to fetch error logs from Logstash after the swagger checks, and assert there were no warning/errors/fatals in any hhvm/mediawiki channel. (The urls in our swagger spec are known to pass without errors, enforced in Jenkins per T50002).

A shell out for echo 1 | mwscript eval.php that catches fatals and non-empty php-stderr on the deploy host (without app-server roles or Logstash queries) could've prevented "all-wiki" fatals and logspam incidents without affecting traffic (incl canaries).

Whether with mwscript, or with local swagger specs and a logstash query, or something else – I'm hoping we can soon prevent fatal errors :)

Krinkle added a comment.EditedThu, Sep 20, 8:12 PM

Ping. The above one-liner would've prevented the following from affecting production traffic to canary hosts today.

https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/461707/
> PHP Notice: Undefined variable: wgSiteMatrixNonGlobalSites

This affected all requests/jobs on canaries. This is low-hanging fruit that can be easily and reliably caught without affecting production first (canaries are production).

The logstash-based queries should imho be considered as helping catch less common errors (errors affecting a subset of users that we can't anticipate through hardcoded pre-promote checks). It is not reliable as first layer of defence because it is inherently fragile (T204907, T200960, T183999, T154646, T142784, T165675), is not automatically rolled back, and we shouldn't use production to check for errors that affect every request.

Alternatively, if we prefer HTTP instead of CLI, we can do something like this:

  • Have swagger checks set a Header that wmf-config will use to turn all PHP errors into fatals.
  • Run swagger checks before serving production traffic (e.g. from deployment host or mwdebug).

But either way, I'd very much like to achieve what we set out in 2015, which is that "obvious errors that affect all requests to all wikis, must be impossible to deploy through normal means in a way that breaks production traffic".

Running as much as echo 1 | mwscript eval.php on the deployment host today would achieve that in a reliable way without dependencies. Running swagger checks (on non-prod hosts) with a way to make errors emit HTTP errors would also work but is more work to setup.

I'll take a stab at calling echo 1 | mwscript eval.php in scap.