Page MenuHomePhabricator

Investigate 30x increase in Jobrunner errors
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=1498946272666&to=1500370361611

Screen Shot 2017-07-21 at 21.00.45.png (712×1 px, 103 KB)

The spike starts very rapidly at 2017-07-11 16:50 exactly.

We went from under 30-50 errors per minute to 1000 errors per minute. They're from Jobrunner, which means the counts are in statsd/Graphite, but the errors are only written to stderr (stored in /var/log/mediawiki/jobrunner on the job runners).

Source:


Action items:

  • Fix problem by stopping the job runner from trying to run jobs for the deleted wiki.
  • Figure out why it was still running in the first place. (The wiki was deleted at least 3 weeks ago.)
  • Update https://wikitech.wikimedia.org/wiki/Delete_a_wiki if needed.
  • Index errors from jobrunner.log in Logstash?
  • Once the logs are index, make sure they are included in Scap's our Logstash monitor so that any obvious problems such as these (30x increase) do get detected when deployed, and automatically rolled back.

Event Timeline

Checking the local logs on one of the servers (mw1303), the error is quite easily found. The last two log files are 1000x bigger (19G instead of 8M).

And at exactly, the flood of errors begin. They are all the same:

/var/log/mediawiki/jobrunner.log.1
Jul 11 16:53:25 mw1303 jobrunner[3826]: 2017-07-11T16:53:25+0000 ERROR: Runner loop 1 process in slot 5 gave status '0':
Jul 11 16:53:25 mw1303 jobrunner[3826]: json_decode() error (4): Syntax error: @todo more info
Jul 11 16:53:25 mw1303 jobrunner[3826]: curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=ukwikimedia&type=htmlCacheUpdate&maxtime=30&maxmem=300M' STDOUT:
Jul 11 16:53:25 mw1303 jobrunner[3826]: <!DOCTYPE html>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <html lang="en" dir="ltr">
Jul 11 16:53:25 mw1303 jobrunner[3826]: <head>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <title>No wiki found</title>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <style type="text/css">
Jul 11 16:53:25 mw1303 jobrunner[3826]: /* <![CDATA[ */
Jul 11 16:53:25 mw1303 jobrunner[3826]: * {
Jul 11 16:53:25 mw1303 jobrunner[3826]: font-family: 'Gill Sans MT', 'Gill Sans', sans-serif;
Jul 11 16:53:25 mw1303 jobrunner[3826]: margin: 0;
Jul 11 16:53:25 mw1303 jobrunner[3826]: padding: 0;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: body {
Jul 11 16:53:25 mw1303 jobrunner[3826]: background: #fff url('//upload.wikimedia.org/wikipedia/commons/9/96/Errorbg.png') repeat-x;
Jul 11 16:53:25 mw1303 jobrunner[3826]: color: #333;
Jul 11 16:53:25 mw1303 jobrunner[3826]: margin: 0;
Jul 11 16:53:25 mw1303 jobrunner[3826]: padding: 0;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: #page {
Jul 11 16:53:25 mw1303 jobrunner[3826]: background: url('//upload.wikimedia.org/wikipedia/commons/thumb/8/81/Wikimedia-logo.svg/300px-Wikimedia-logo.svg.png') center left no-repeat;
Jul 11 16:53:25 mw1303 jobrunner[3826]: height: 300px;
Jul 11 16:53:25 mw1303 jobrunner[3826]: left: 50%;
Jul 11 16:53:25 mw1303 jobrunner[3826]: margin: -150px 0 0 -360px;
Jul 11 16:53:25 mw1303 jobrunner[3826]: overflow: visible;
Jul 11 16:53:25 mw1303 jobrunner[3826]: position: absolute;
Jul 11 16:53:25 mw1303 jobrunner[3826]: top: 50%;
Jul 11 16:53:25 mw1303 jobrunner[3826]: width: 720px;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: #message {
Jul 11 16:53:25 mw1303 jobrunner[3826]: background: url('//upload.wikimedia.org/wikipedia/commons/9/97/Errorline.png') center left no-repeat;
Jul 11 16:53:25 mw1303 jobrunner[3826]: margin-left: 300px;
Jul 11 16:53:25 mw1303 jobrunner[3826]: padding-left: 15px;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: h1, h2, p {
Jul 11 16:53:25 mw1303 jobrunner[3826]: margin-bottom: 1em;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: a:link, a:visited {
Jul 11 16:53:25 mw1303 jobrunner[3826]: color: #005b90;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }
Jul 11 16:53:25 mw1303 jobrunner[3826]: a:hover, a:active {
Jul 11 16:53:25 mw1303 jobrunner[3826]: color: #900;
Jul 11 16:53:25 mw1303 jobrunner[3826]: }/* ]]> */
Jul 11 16:53:25 mw1303 jobrunner[3826]: </style>
Jul 11 16:53:25 mw1303 jobrunner[3826]: </head>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <body>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <div id="page">
Jul 11 16:53:25 mw1303 jobrunner[3826]: <div id="message">
Jul 11 16:53:25 mw1303 jobrunner[3826]: <h1>No wiki found</h1>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <p>Sorry, we were not able to work out what wiki you were trying to view.
Jul 11 16:53:25 mw1303 jobrunner[3826]: Please specify a valid Host header.</p>
Jul 11 16:53:25 mw1303 jobrunner[3826]: <p style="font-size: smaller;">A&nbsp;project of the <a href="//wikimediafoundation.org" title="Wikimedia Foundation">Wikimedia Foundation</a></p>
Jul 11 16:53:25 mw1303 jobrunner[3826]: </div>
Jul 11 16:53:25 mw1303 jobrunner[3826]: </div>
Jul 11 16:53:25 mw1303 jobrunner[3826]: </body>
Jul 11 16:53:25 mw1303 jobrunner[3826]: </html>

The problem of ukwikimedia not existing was most likely caused by commit rOMWC30606caf3fb6: Move ukwikimedia to deleted.dblist / https://gerrit.wikimedia.org/r/360564 (@bd808, @demon).

Strangely, I can't find anything in the Server Admin Log about the deployment of this patch.

Server Admin Log (2017-07-11)
16:32 bblack: restarting varnish backend on cp1074 (mailbox lag)
17:19 thcipriani: starting branch cut for 1.30.0-wmf.9 T167893
17:20 mutante: mw2201, mw2202 - depool appservers for T169360 (drain flea power)

Which probably means it was merged without deploying, and then accidentally rolled out as part of preparing 1.30.0-wmf.9 (T167893, @thcipriani).

(Added action items to the task description)

Krinkle renamed this task from Investigate Jobrunner error increase to Investigate 100x increase in Jobrunner errors.Jul 22 2017, 4:15 AM
Krinkle renamed this task from Investigate 100x increase in Jobrunner errors to Investigate 30x increase in Jobrunner errors.Jul 22 2017, 4:30 AM
Krinkle updated the task description. (Show Details)

Which probably means it was merged without deploying, and then accidentally rolled out as part of preparing 1.30.0-wmf.9

It was deployed by @demon:

2017-06-27T23:05 <demon@tin> Synchronized dblists/: ukwikimedia swapped from closed to deleted (duration: 00m 46s)

Broadly this looks a whole lot like T87360: Bad LocalRenameUserJob stuck in jobrunner for vewikimedia from 2.5 years ago. There are cleanup steps documented in that task for how we made the jobrunner finally stop spamming the logs about a missing wiki.

Which probably means it was merged without deploying, and then accidentally rolled out as part of preparing 1.30.0-wmf.9

It was deployed by @demon:

2017-06-27T23:05 <demon@tin> Synchronized dblists/: ukwikimedia swapped from closed to deleted (duration: 00m 46s)

The follow up at rOMWC070d84a7bea6: Remove ukwikimedia from config was also deployed by @demon:

2017-07-06T19:05 <demon@tin> Started scap: Forcing l10n rebuild for James_F, plus some wmf-config cleanup

It was deployed by @demon:

2017-06-27T23:05 <demon@tin> Synchronized dblists/: ukwikimedia swapped from closed to deleted (duration: 00m 46s)

The follow up at rOMWC070d84a7bea6: Remove ukwikimedia from config was also deployed by @demon:

2017-07-06T19:05 <demon@tin> Started scap: Forcing l10n rebuild for James_F, plus some wmf-config cleanup

Per T171371#3462648, it was not until 2017-07-11 16:53 that HTTP requests from jobchron to /rpc/runJobs.php for ukwikimedia started failing. Up to that point, Apache's were presumably still responding to those requests. Which suggests that this sync was either a no-op or it didn't apply to job runners, and that something on June 11 did make jobrunners get the latest code.

Change 368116 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/WikimediaMaintenance@master] Create new wmfManageJobs.php for deleting queues from deleted wikis

https://gerrit.wikimedia.org/r/368116

Krinkle moved this task from Inbox, needs triage to Radar on the Performance-Team board.
Krinkle edited projects, added Release-Engineering-Team; removed Patch-For-Review, SRE.

Change 368116 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Create new wmfManageJobs.php for deleting queues from deleted wikis

https://gerrit.wikimedia.org/r/368116

Krinkle added a subscriber: aaron.

Change 368116 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Create new wmfManageJobs.php for deleting queues from deleted wikis

https://gerrit.wikimedia.org/r/368116

@aaron Could you take a look at this? This script is now in the production branches but I've not been able to run it with success. After trying to delete any and all queues using this version of the script, the error rates have not changed, and the logs show that jobrunner is still being invoked for the same non-existent wiki:

2017-08-07T21:18:11+0000 ERROR: Runner loop 1 process in slot 5 gave status '0':
Aug  7 21:18:11 mw1299 jobrunner[40736]: json_decode() error (4): Syntax error: @todo more info
Aug  7 21:18:11 mw1299 jobrunner[40736]: curl -XPOST -s -a 'http://127.0.0.1:9005/rpc/RunJobs.php?wiki=ukwikimedia&type=htmlCacheUpdate&maxtime=60&maxmem=300M' STDOUT:

I guess this is because JobQueue::delete() doesn't notify the aggregator. Although even if it would, it seems in the case of Redis, this is unsupported as for Redis aggregation, we assume the service (JobChron?) to do it instead.

I suppose it's possible for JobChron to manually iterate and aggregate, but it doesn't seem to do that, as otherwise it would've updated the map to remove ukwikimania after I ran wmfManageJobs.php.

Did you check redis? Are their jobs back in the queues? It's possible some remote wiki is inserting jobs via JobSpecification or something.

Looking at https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobRunnerService it throws away the old in-memory map and replaces it with the new one based on global-jobqueue-s-queuesWithJobs-* from each server. There is no persistent version of that JobRunnerService in-memory map. There must be global-jobqueue-s-queuesWithJobs-* keys for the queues anyway. Assuming delete() actually removed them, the only thing that can (rather than remove) global-jobqueue-s-queuesWithJobs-* entries is doPush().

aaron@terbium:~$ mwscript extensions/WikimediaMaintenance/wmfManageJobs.php aawiki --target ukwikimedia
refreshLinks: 0 queued; (empty or doesn't exist)
htmlCacheUpdate: 22 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

Odd, it's gone from $wgLocalDatabases, so I wonder what is remote enqueueing html update jobs using a different list or DB table (wikidata?).

Change 370978 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Disallow job pushes from JobQueueGroup to bogus wikis

https://gerrit.wikimedia.org/r/370978

Change 370978 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Disallow job pushes from JobQueueGroup to bogus wikis

https://gerrit.wikimedia.org/r/370978

Nice catch. So these must be new jobs then, and this'll help catch where they are coming from. I assume the problem was with the queue deletion script (e.g. maybe the global/aggregator entry persisted). But that would only explain the RPC requests being made, it does not explain wmfManageJobs.php showing a non-empty queue (again).

Change 370978 merged by jenkins-bot:
[mediawiki/core@master] Disallow job pushes from JobQueueGroup to bogus wikis

https://gerrit.wikimedia.org/r/370978

Change 371643 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.30.0-wmf.13] Disallow job pushes from JobQueueGroup to bogus wikis

https://gerrit.wikimedia.org/r/371643

Change 371643 merged by Krinkle:
[mediawiki/core@wmf/1.30.0-wmf.13] Disallow job pushes from JobQueueGroup to bogus wikis

https://gerrit.wikimedia.org/r/371643

Mentioned in SAL (#wikimedia-operations) [2017-08-12T20:00:52Z] <krinkle@tin> Synchronized php-1.30.0-wmf.13/includes/jobqueue/JobQueueGroup.php: T171371 - Log job pushes to bogus wikis (duration: 00m 53s)

Max and I also noticed backlinks to this wiki from commons in global usage tables.

Change 371587 had a related patch set uploaded (by Aaron Schulz; owner: MaxSem):
[mediawiki/extensions/WikimediaMaintenance@master] Add deleteWiki.php

https://gerrit.wikimedia.org/r/371587

Mentioned in SAL (#wikimedia-operations) [2017-08-12T20:00:52Z] <krinkle@tin> Synchronized php-1.30.0-wmf.13/includes/jobqueue/JobQueueGroup.php: T171371 - Log job pushes to bogus wikis (duration: 00m 53s)

Thanks. I just delete those jobs now.

Error rate went from 500-1000/s to 50-80/s.

aaron updated the task description. (Show Details)
aaron updated the task description. (Show Details)

Closing. The two logging-related improvement action item left have their own tasks.

Change 371587 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Add deleteWiki.php

https://gerrit.wikimedia.org/r/371587