- T171371: Investigate 30x increase in Jobrunner errors
- https://wikitech.wikimedia.org/wiki/Incident_documentation/20170718-JobQueue / T129148
- Ensure that (if not already) at least one job runner is included in the list of canary servers that Scap uses for deploying MediaWiki code. This alone will already be an improvement, as any hits for mediawiki/exception, mediawiki/error or hhvm that only happen in job runner context would then be caught early.
- Include ERROR (and higher) severity messages from the mediawiki/runJobs channel in the Logstash query for canary monitoring.
- Once the jobrunner and jobchron service logs are indexed by Logstash, include ERROR (and higher) severity messages in the Logstash query.
Note that the jobrunner and jobchron services are independent PHP CLI programs (not MediaWiki cli scripts) so their logs will have a different type, and are not presently included anywhere else.