The idea is to develop a maintenance script that simulates one or more simple web requests and we require that script to pass without any notices, warnings, errors or exceptions from PHP.## Problem
It could be run on tinWhen deploying files in `/srv/mediawiki` to production, errors can be introduced due to any number of reasons. It is not feasible to catch all possible error scenarios in unit testing, integration testing, beta testing, or even mwdebug testing.
As final layer of defence, scap should perform a handful of pre-promote checks that must pass without which a deployment must not be allowed to proceed to production servers.
## Existing work
We have php syntax **linting** as part of the `merge` pipeline in mediawiki/core, in extension repos, and in the operations/mediawiki-config repo. Yet, due to various reasons, a syntax error can still enter production. Due to differences in PHP versions, or due to the linter being invoked on a subdirectory and not all files, or due to the linter ignoring symlinks or otherwise files being excluded. There are also private files in the deployment host. As final check, Scap already performs one additional lint check.
We also have **beta cluster**, which should in theory catch all run-time issues aside from production-specific configuration overrides and private files. Although in practice:
1. Don't check health of beta cluster during deployment.
2. Aside from configuration, Beta Cluster runs on a **different branch** (master vs wmf). Issues specific to backports would not be detected there. And even if the final state after a deployment were the same on Beta and in production, for Beta we typically merge and deploy multiple commits at once. Whereas in production we deploy 1 commit and 1 file/directory at once, which means there is another area of things to go wrong. And the below list of incidents indicates that this is quite a common source of errors (whether to split a deploy, and if so, e.g.what the correct order is).
We also have **debug servers**, which in theory should catch all run-time issues, sync to local `/srv/mediawiki` on tin first,including anything production specific. run the script and then continue sync if successful.Although in practice:
This would eliminate a large class of errors:1. Staging of changes on mwdebug happens through `scap pull`, which updates the entire deployment in one (nearly) atomic step. Whereas for production, each file and directory is **synced separately**. The same common source of errors mentioned above (Beta cluster, `#2`) also is not caught on mwdebug servers.
* Subtle errors in functional syntax (not strict parse error,2. Staging on mwdebug uses `scap pull`, which does not rebuild l10n locally - **localisation-related issues** are thus often missed on mwdebug servers.
We also have **canary servers**, which have been introduced after the creation of this task. Files are deployed there before the rest of production. The method of deployment and content matches 100%. However, the method of monitoring is insufficient. We currently monitor the canary servers through logstash, by querying various channels and do a before/after comparison. In theory this should catch all significant issues affecting production. Both that affect all requests, and even those that affect a large number requests. but fatal on runtimeAlthough in practice, such as the infamous `arrray()` typoit has yet to prevent a deployment of a fatal error or php notice affecting all pages on all wikis.
Various reasons at different times:
* Due to php notices not being included in the logstash query.
* PHP notices or warnings caused in unconditional code paths (e.g. a mistyped variable in wmf-config, or in global MediaWiki code)* Due to fatal errors not being logged.
* Exceptions that happen on every page view* Due to fatal errors not being included correctly in the logstash query.
* Files synced in the wrong order.* Due to the before/after comparison not working as expected.
## Solution
The core idea is already implemented as [https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/master/sanityCheck.php](https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php), we can further develop it by also catching warnings and exceptionidea is to write a simple maintenance script that simulates a small number of requests. It'll also need updating since the "Served by" text is probably outdatedWe'd require that script to pass without any errors (notices/warnings) or exceptions from PHP.
A basic set of checks,It could be run once tin, against the staging server (e.g. tin) would catch 99% of cases where a PHP notice or fatal error happens on a common page load.sync to local `/srv/mediawiki` on tin first, A more elaborate canary check can run after this one, which would catch less common issues from a portion of actual production traffic.run the script and then continue sync if successful.
### Benefits
Benefit of a pre-check script compared to manual canary testing:Eliminate a large class of errors:
* Subtle errors in functional syntax. Not strict parse error, but fatal on runtime, such as the infamous `arrray()` typo.
* PHP notices or warnings affecting all views (e.g. a mistyped variable in wmf-config, or in global MediaWiki code).
* Catches when files are synced in the wrong order (manually canary syncing usually involves a full `scap pull`, and therefore will not catch files synced in the wrong order as the deployer would have staged both files)PHP fatal exceptions that happen on all views.
* Catches i18n issues (scap-pull doesn't currently support rebuilding l10n locally)* Any of the above, as result of files syncs not being split, or being synced in the wrong order.
Proposals for implementation:The core idea was previously implemented in 2013 at [https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php](https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php), we can further develop it by also catching warnings and exceptions.
* A basic MediaWiki maintenance script invoked by scaA basic set of checks, run once, against the staging server (e.g. tin) would catch 99% of cases where a PHP notice or fatal error happens on a common web request.
### Implementation ideas
1. MediaWiki maintenance script: Simple no-op.
* Stderr will contain PHP notices* Invoke `echo 1 | mwscript eval.php --wiki enwiki;`, this would already exercise all wmf-config code, warnings,which is the most common source of errors. fatalsRun-time errors from MediaWiki core or extension code is rare, and MediaWiki warnings/errorsespecially given Jenkins now catches these pre-merge.
* The common errors wouldn't even require the maintenance script to do anything since simply invoking `echo 1 | mwscript eval.php --wiki enwiki;` would already exercise the most common code.
* A more elaborate MediaWiki maintenance script invoked by scapStderr will contain PHP notices, warnings, fatals, and MediaWiki warnings/errors.
* The script could use classes `FauxRequest` and `MediaWiki` to simulate a small set of common actions. (view main page, history, log in page,2. basic Api query.)MediaWiki maintenance script: Render enwiki Main Page
* Alternatively it could make requests to itself over HTTP using php-curl* Like the old `sanityCheck.php` script, so that WebStart and index.php are properly exercised. This has the downside of making it harder to aggregate errors as they would no longer go to stderr,we could instantiate MediaWiki programmatically to execute one ore more common actions. E.g. which means we need to make apache/hhvm/mediawiki logs are tracked.render Main Page view, Although if done in addition to FauxRequestMain Page history, then verifying the output and http status code might sufficeand a basic Api query with FauxRequest.
* A python script called within scap.3. HTTP-based
* It would make a set of requests over HTTP to thef we can make sure Apache is installed and working on the deployment host, we can do a few simple HTTP requests as well. E.g. implemented as a python plugin for scap, that makes a set of HTTP requests to the local deployment host and verify the response. This has the benefit of catching all fatal errors no matter whether they come from apache/hhvm/mediawiki by simply checking the http status code and response body (unlike T154646, T142784 - which missed an obvious HTTP 5xx error no all pages).
The downside of the HTTP-based approach is that it wouldn't catch php notices/warnings that don't make the response fail, The downside is that it wouldn't catch php notices/warnings that don't make the response fail,but would certainly flood the logs (unless we tail the local logs somehow).
My recommendation would be to start with point 1. but would certainly flood the logs. Hence a combination is probably best.And then in addition maybe point 2 or point 3.
## Preventable incidents
Incidents that would've been avoided if we had even a single http check against a fixed url (eg. enwiki/Main_Page)
* [20151005-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20151005-MediaWiki) - Fatal exception everywhere due to undefined function - aka `arrray()` error in wmf-config
* [20151026-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20151026-MediaWiki) - Fatal exception, from MediaWiki/ResourceLoader.
* [20160212-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160212-AllWikisOutage) - Fatal exception from PHP (file missing, synced in the wrong order)
* 20160222-MediaWiki - HTTP 404 Not Found everywhere due to accidental removal of mediawiki docroot symlink
* [20160407-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160407-Mediawiki) - fatal due to invalid PHP expression caused by a bad config change
* [20160601-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki) - Fatal exception everywhere due to typo in extension name being loaded.
* [20160713-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-MediaWiki) - Fatal exception on certain (not all) page actions due to misnamed PHP file inclusion.
* [20170104-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170104-MonologSpi) - Fatal exception everywhere due to invalid return value from a callback in wmf-config.
* [20170111-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170111-multiversion) - Fatal exception from multiversion.
* [20170124-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170124-WikibaseClient-InterwikiSorting) - Fatal exception on all wikis, from WikibaseClient.
* [20180129-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20180129-MediaWiki) - Fatal exception everywhere, from wmf-config.
## See also
* {T173146}
* {T183952}
* {T183999}
* {T154646}
* ...