When deploying files in `/srv/mediawiki` to production, errors can be introduced due to any number of reasons. It is not feasible to catch all possible error scenarios in unit testing, integration testing, beta testing, or even mwdebug testing.
As final layer of defence, scap should perform a handful of pre-promote checks that must pass without which a deployment must not be allowed to proceed to production servers.
## Existing work
We have php syntax **linting** as part of the `merge` pipeline in mediawiki/core, in extension repos, and in the operations/mediawiki-config repo. Yet, due to various reasons, a syntax error can still enter production. Due to differences in PHP versions, or due to the linter being invoked on a subdirectory and not all files, or due to the linter ignoring symlinks or otherwise files being excluded. There are also private files in the deployment host. As final check, Scap already performs one additional lint check.
We also have **beta cluster**, which should in theory catch all run-time issues aside from production-specific configuration overrides and private files. Although in practice:
1. Don't check health of beta cluster during deployment.
2. Aside from configuration, Beta Cluster runs on a **different branch** (master vs wmf). Issues specific to backports would not be detected there. And even if the final state after a deployment were the same on Beta and in production, for Beta we typically merge and deploy multiple commits at once. Whereas in production we deploy 1 commit and 1 file/directory at once, which means there is another area of things to go wrong. And the below list of incidents indicates that this is quite a common source of errors (whether to split a deploy, and if so, what the correct order is).
We also have **debug servers**, which in theory should catch all run-time issues, including anything production specific. Although in practice:
1. Staging of changes on mwdebug happens through `scap pull`, which updates the entire deployment in one (nearly) atomic step. Whereas for production, each file and directory is **synced separately**. The same common source of errors mentioned above (Beta cluster, `#2`) also is not caught on mwdebug servers.
2. Staging on mwdebug uses `scap pull`, which does not rebuild l10n locally - **localisation-related issues** are thus often missed on mwdebug servers.
We also have **canary servers**, which have been introduced after the creation of this task. Files are deployed there before the rest of production. The method of deployment and content matches 100%. However, the method of monitoring is insufficient. We currently monitor the canary servers through logstash, by querying various channels and do a before/after comparison. In theory this should catch all significant issues affecting production. Both that affect all requests, and even those that affect a large number requests. Although in practice, it has yet to prevent a deployment of a fatal error or php notice affecting all pages on all wikis.
Various reasons at different times:
* Due to php notices not being included in the logstash query.
* Due to fatal errors not being logged.
* Due to fatal errors not being included correctly in the logstash query.
* Due to the before/after comparison not working as expected.
The idea is to write a simple maintenance script that simulates a small number of requests. We'd require that script to pass without any errors (notices/warnings) or exceptions from PHP.
It could be run on tin, e.g. sync to local `/srv/mediawiki` on tin first, run the script and then continue sync if successful.
Eliminate a large class of errors:
* Subtle errors in functional syntax. Not strict parse error, but fatal on runtime, such as the infamous `arrray()` typo.
* PHP notices or warnings affecting all views (e.g. a mistyped variable in wmf-config, or in global MediaWiki code).
* PHP fatal exceptions that happen on all views.
* Any of the above, as result of files syncs not being split, or being synced in the wrong order.
The core idea was previously implemented in 2013 at [https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php](https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/f52b13b1/sanityCheck.php), we can further develop it by also catching warnings and exceptions.
A basic set of checks, run once, against the staging server (e.g. tin) would catch 99% of cases where a PHP notice or fatal error happens on a common web request.
### Implementation ideas
1. MediaWiki maintenance script: Simple no-op.
* Invoke `echo 1 | mwscript eval.php --wiki enwiki;`, this would already exercise all wmf-config code, which is the most common source of errors. Run-time errors from MediaWiki core or extension code is rare, especially given Jenkins now catches these pre-merge.
* Stderr will contain PHP notices, warnings, fatals, and MediaWiki warnings/errors.
2. MediaWiki maintenance script: Render enwiki Main Page
* Like the old `sanityCheck.php` script, we could instantiate MediaWiki programmatically to execute one ore more common actions. E.g. render Main Page view, Main Page history, and a basic Api query with FauxRequest.
* If we can make sure Apache is installed and working on the deployment host, we can do a few simple HTTP requests as well. E.g. implemented as a python plugin for scap, that makes a set of HTTP requests to the local deployment host and verify the response. This has the benefit of catching all fatal errors no matter whether they come from apache/hhvm/mediawiki by simply checking the http status code and response body (unlike T154646, T142784 - which missed an obvious HTTP 5xx error no all pages).
The downside of the HTTP-based approach is that it wouldn't catch php notices/warnings that don't make the response fail, but would certainly flood the logs (unless we tail the local logs somehow).
My recommendation would be to start with point 1. And then in addition maybe point 2 or point 3.
## Preventable incidents
* [20151005-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20151005-MediaWiki) - Fatal exception everywhere due to undefined function - aka `arrray()` error in wmf-config
* [20151026-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20151026-MediaWiki) - Fatal exception, from MediaWiki/ResourceLoader.
* [20160212-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160212-AllWikisOutage) - Fatal exception from PHP (file missing, synced in the wrong order)
* 20160222-MediaWiki - HTTP 404 Not Found everywhere due to accidental removal of mediawiki docroot symlink
* [20160407-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160407-Mediawiki) - fatal due to invalid PHP expression caused by a bad config change
* [20160601-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160601-MediaWiki) - Fatal exception everywhere due to typo in extension name being loaded.
* [20160713-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-MediaWiki) - Fatal exception on certain (not all) page actions due to misnamed PHP file inclusion.
* [20170104-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170104-MonologSpi) - Fatal exception everywhere due to invalid return value from a callback in wmf-config.
* [20170111-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170111-multiversion) - Fatal exception from multiversion.
* [20170124-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20170124-WikibaseClient-InterwikiSorting) - Fatal exception on all wikis, from WikibaseClient.
* [20180129-MediaWiki](https://wikitech.wikimedia.org/wiki/Incident_documentation/20180129-MediaWiki) - Fatal exception everywhere, from wmf-config.
## See also