<https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=1498946272666&to=1500370361611>
{F8809026}
The spike starts very rapidly at 2017-07-11 16:50 exactly.
We went from under 30-50 errors per minute to 1000 errors per minute. They're from Jobrunner, which means the counts are in statsd/Graphite, but the errors are only written to stderr (stored in /var/log/mediawiki/jobrunner on the job runners).
Source:
* <https://github.com/wikimedia/mediawiki-services-jobrunner/blob/4e8e09cd8db5bfd747105906d8a4b47fca225da9/src/JobRunnerPipeline.php>
------
Action items:
* [ ] Fix problem by stopping the job runner from trying to run jobs for the deleted wiki.
* [ ] Figure out why it was still running in the first place. (The wiki was deleted at least 3 weeks ago.)
* [ ] Update <https://wikitech.wikimedia.org/wiki/Delete_a_wiki> if needed.
* [ ] Index errors from jobrunner.log in Logstash?
* [ ] Once the logs are index, make sure they are included in Scap's our Logstash monitor so that any obvious problems such as these (30x increase) do get detected when deployed, and automatically rolled back.