Page MenuHomePhabricator

The refreshLinks jobs enqueue rate is 10 times the normal rate
Closed, ResolvedPublic

Description

Since first reported, the jobqueue grew up to 32 million jobs and is now steadily reducing and currently under 26 millions thanks to a series of tweaks we performed, but the real worrying fact is that the number of inserted jobs has grown by almost 1000% compared to last week.

Last time we had this, this was T124194, but that was a job queue performance issue; at the moment our job queues are operating at throughputs we've never seen before.

This has consequences on the databases (see T129830) and in general puts a strain on our infrastructure, and should be treated as a serious production issue

Grafana view: https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=15&fullscreen

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Subscribing people who I believe were investigating this earlier.

Not sure if errors like these are related: Redis exception connecting to "rdb1005.eqiad.wmnet:6381" with exception message: read error on connection. They appear in logstash at a rate of 10k/10s. Thrown on job runners from requests like /rpc/RunJobs.php?wiki=cawiki&type=cirrusSearchLinksUpdatePrioritized&maxtime=30&maxmem=300M. Backtrace:

/srv/mediawiki/php-1.27.0-wmf.16/includes/clientpool/RedisConnectionPool.php:258
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueueRedis.php:748
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueueRedis.php:315
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueue.php:359
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueueFederated.php:290
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueue.php:359
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobQueueGroup.php:192
/srv/mediawiki/php-1.27.0-wmf.16/includes/jobqueue/JobRunner.php:158
/srv/mediawiki/rpc/RunJobs.php:47

@demon @elukey @jcrespo @ori: you all were investigating this issue earlier, can you please summarize what you found on this task? The issue was severe enough in the morning to block the SWAT deploys so it deserves some level of report back. As it is, I need to know if we are going to also block the MW train deploy.

https://logstash.wikimedia.org/#dashboard/temp/AVNiKpq8O3D718AOM4Cg

The errors have not subsided, but as you can see in the graph I pasted above ^ the error level before and after yesterday's train deployment was steady, it wasn't until a bit later when the errors started climbing.

hashar lowered the priority of this task from Unbreak Now! to High.Mar 10 2016, 8:33 PM
hashar subscribed.
[20:13:22]  <greg-g>	ori: do you want to block the train or no?
[20:13:25]  <ori>	no

Solved, summary to come up later.

Change 276653 had a related patch set uploaded (by Ori.livneh):
job{cron,runner}: don't attempt JSON validation of config file

https://gerrit.wikimedia.org/r/276653

Change 276653 merged by Ori.livneh:
job{cron,runner}: don't attempt JSON validation of config file

https://gerrit.wikimedia.org/r/276653

meh, it is still growing... currently 23 mil jobs

[20:13:22]  <greg-g>	ori: do you want to block the train or no?
[20:13:25]  <ori>	no

Solved, summary to come up later.

Given https://gerrit.wikimedia.org/r/276744, I'm doubtful that this issue is solved. With a total job queue size of 25 million jobs currently, I fail to see how this is task is not "Unbreak Now!"-level priority.

Screenshot from https://grafana.wikimedia.org/dashboard/db/job-queue-health:

Screen Shot 2016-03-11 at 9.49.00 AM.png (801×2 px, 84 KB)

demon raised the priority of this task from High to Unbreak Now!.Mar 11 2016, 2:53 PM

Change 276747 had a related patch set uploaded (by Elukey):
Increase the mediawiki::jobrunner::runners_basic concurrency parameter for mw116[3-6] as temporary measure to consume more refreshLinks jobs.

https://gerrit.wikimedia.org/r/276747

Change 276747 merged by Elukey:
Increase mediawiki::jobrunner::runners_basic concurrency for mw116[3-6]

https://gerrit.wikimedia.org/r/276747

Change 276759 had a related patch set uploaded (by Elukey):
Increase mediawiki::jobrunner::runners_basic concurrency for mw116[123789].

https://gerrit.wikimedia.org/r/276759

Change 276759 merged by Elukey:
Increase mediawiki::jobrunner::runners_basic concurrency for mw116[123789].

https://gerrit.wikimedia.org/r/276759

All the mw116[1-9] job runners are running with mediawiki::jobrunner::runners_basic set to 30 (was 20 - basic contains refreshLinks). The CPU utilization went up nicely to ~50%

https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2

No major signs of improvement..

Other ideas to solve this? I guess there only specific types of jobs affected aren't they? Otherwise users at wikis should noticed this, but there I didn't see any reaction. Last time, there was a reaction, because we had 6 million jobs, and page link etc were not updated, but I guess in this case this is just one job type, so increasing the ressources may not solve this problem?

Other ideas to solve this? I guess there only specific types of jobs affected aren't they? Otherwise users at wikis should noticed this, but there I didn't see any reaction.

My reaction: It is impossible to add a newly uploaded image into a page with the VisualEditor because the VE do not find the new file.

What @aaron said.

This appears to be resolved. Redis errors are back down to nearly nothing.

mmodell lowered the priority of this task from Unbreak Now! to Medium.Mar 12 2016, 10:22 AM

Note that while the job queue no longer seems to be increasing (https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=12&fullscreen&from=now-1d&to=now for the last couple of hours) it's not decreasing either.

Very interesting thing from https://grafana.wikimedia.org/dashboard/db/job-queue-health: enqueue rate is equal to processing rate, so this is probably why the queue depth is not decreasing. Maybe forcing the job runners to process more refreshLinks jobs could help in unbalancing the ratio?

This is not resolved at all; the main issue being we're enqueueing jobs at almost 10 times the rate that we normally see.

Enqueued jobs that have raised are the refreshLinks jobs, as can be seen from

https://grafana.wikimedia.org/dashboard/db/job-queue-rate

97% of presently enqueued jobs are of that type, up from 10-20% in normal conditions.

A jobqueue bug seems unlikely, as the number of equeued jobs seems to be very wiki-specific; the main offenders are:

$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php | sort -n -k2 | tail -n 11

enwiki 1432497
ptwiki 1860722
frwiktionary 1967558
itwiki 2023771
frwiki 2291454
ruwiki 2327745
svwiki 3291448
enwiktionary 3299588
commonswiki 3322845
srwiki 3602164
Total 25419913

As you can see, enwiki is 11th in this list, which suggests this has nothing to do with the actual edit traffic on the sites.

Next thing I'll do is cross-reference the major offenders with the list of installed extensions, maybe we find some specific extension that is common to all those wikis and might be causing this.

Joe renamed this task from Job queue is growing and growing to The refreshLinks jobs enqueue rate is 10 times the normal rate.Mar 14 2016, 8:46 AM
Joe raised the priority of this task from Medium to Unbreak Now!.

The analysis of enabled extensions is not suggesting me any immediate correlation with the behaviour we're seeing.

The fact most of those refreshLinks jobs are bogus is confirmed by the fact that the edge cache purge rate has not been influenced by this surge, as I'd expect.

Other findings:

  • Terbium ran refreshLink cron jobs between Mar 1st and Mar 8th; so this is not caused by those cronjobs
  • The jobrunners are suffering due to the removal of persistent connections, and their conntrack tables are nearly full on some hosts. I am not applying any corrective measure for now as it seems still not needed.

As a backpressure measure, I proposed to move the refreshLinks Jobs to a separate, dedicated queue, as it has been done for htmlCacheUpdate jobs. I am very cautious moving on with this as it might have unforeseen effects and the databases are already under stress.

An alternative that I would prefer is to drop all refreshlinks jobs from the most affected wikis as an exceptional measure to recover the queue job processing times while we find the root cause; one must assume that most of those are bogus jobs anyways.

Also a smaller, more manageable queue could be easier to debug.

As a backpressure measure, I proposed to move the refreshLinks Jobs to a separate, dedicated queue, as it has been done for htmlCacheUpdate jobs. I am very cautious moving on with this as it might have unforeseen effects and the databases are already under stress.

+1. Isolating these jobs seems like a good way forward.

An alternative that I would prefer is to drop all refreshlinks jobs from the most affected wikis as an exceptional measure to recover the queue job processing times while we find the root cause; one must assume that most of those are bogus jobs anyways.

Note that dropping the whole queue might also kill the root cause of the problem, albeit, by now, it is likely that the job that started this chain of events is long gone.

Also a smaller, more manageable queue could be easier to debug.

I have added a graph in Grafana dashboard for the job queue rate that shows up the per hour addition/removal of jobs https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=15&fullscreen

Red/green bars shows the delta:

Capture d’écran 2016-03-14 à 14.16.13.png (514×997 px, 78 KB)

In short from from 2016-03-10 ~7:00 to 2016-03-02 00:00, we add more jobs inserted in the queue than the jobrunner could process (red bars).

We then had a couple of huge drops in queue size represented by the down facing green bars.

I launched a forcerecursivelinkupdate purge API action on 03/10 at 02:40 UTC affecting all pages on plwiktionary, a smaller peak is clearly visible on Grafana right after that. The jobs counter of this project continued growing for the next days when it usually stops after a few hours. Could this have any connection?

I have been looking at refreshLink job on frwiki. It has been triggered by an edit on the page Module:Suivi_des_biographies/données and the job has for page parameter: Robert_James_Waller. The job is stuck in the queue somehow :-/

Using eval.php --wiki=frwiki:

$group = JobQueueGroup::singleton();
$queue = $group->get( 'refreshLinks' );
$claimed = $queue->getAllAcquiredJobs();
foreach( $claimed as $job ) { print $job->toString(); print json_encode( $job->getParams(), JSON_PRETTY_PRINT ); print json_encode($job->metadata, JSON_PRETTY_PRINT); }

That yields:

refreshLinks Module:Suivi_des_biographies/données pages=array(1) rootJobSignature=45868e99bba89064e4483743ebb9b682ef95c1a7 rootJobTimestamp=20160309110158 masterPos=db1023-bin.001288/308257743 triggeredRecursive=1 (uuid=5aa2235025984a76b70f1fa18ec1fd42,timestamp=1457962830,attempts=1)
// params
{
    "pages": {
        "932737": [
            0,
            "Robert_James_Waller"
        ]
    },
    "rootJobSignature": "45868e99bba89064e4483743ebb9b682ef95c1a7",
    "rootJobTimestamp": "20160309110158",
    "masterPos": {
        "file": "db1023-bin.001288",
        "pos": "308257743",
        "asOfTime": 1457521464.3814
    },
    "triggeredRecursive": true
}
// metadata
{
    "uuid": "5aa2235025984a76b70f1fa18ec1fd42",
    "timestamp": 1457962830,
    "attempts": "1"
}

The root job is from a few days ago (2016/03/09). That job has a timestamp of 1457962830 or March 14th 13:40:30 UTC which is more than one hour ago. I am not sure why it stay as claimed status.

Could that hold the other refreshLink jobs having the same triggering title and target page? There are a lot of refreshLink jobs in the queue for title Module:Suivi_des_biographies/données.


I am having a look at the logs. I found a few wikis that show up at the top of the list these last few days but not earlier, so I decided to focus on them. svwiki, srwiki, enwiktionary are the lucky winners. So, looking at svwiki:

I checked entries for svwiki in today's log. Most of them look like this:

2016-03-15 08:27:09 mw1167 svwiki 1.27.0-wmf.16 runJobs DEBUG: refreshLinks Modul:Wikidata2 pages=array(1) rootJobSignature=4162dfc34af3de00b7569698bb71e461e51e63ce rootJobTimestamp=20160309192938 masterPos=db1018bin.001122/434766410 triggeredRecursive=1(uuid=ef16a7a2733848359074f7b0da38ca1c,timestamp=1458023800,QueuePartition=rdb3-6381) STARTING

(or ending in t=num good)

I checked the first 300k and the last 300k messages in today's runJobs log on fluorine, and almost all of the entries are Modul:Wikidata2 with this same rootJobTimestamp. That covers 08:30 til 14:30 today. This seems suspicious to me.

I notice all the masterPos entries are identical too: masterPos=db1018bin.001122/434766410 for all of them.

This same root job with the same masterPos makes its appearance in the same exact fashion in runJobs.log-20160314.gz (first 3 million log lines checked for svwiki entries), in 0313, in 0312. Its first occurrence is in runJobs.log-20160311.gz fwiw.

Is it normal for a root job to take so many days to pass through?

Also I noticed in class RefreshLinksJob that there's this stanza about dedups in the constructor:

		$this->removeDuplicates = (
                        // Master positions won't match
                        !isset( $params['masterPos'] ) && ...

Does this mean that none of these child jobs are eligible for deduping? Or is that irrelevant here? @aaron I could use your wisdom right about now :-)

This certainly does support the notion that root jobs get 'stuck' somehow. And it might be very helpful indeed to flush a handful of them. It would be easy enough to find them in the logs.

Also I note here that Aaron's patch https://gerrit.wikimedia.org/r/#/c/277353/ is pending for merge....

I haven't really looked at the issue but both svwiki and frwiki have a very similar pattern with almost all. I wrote a very very basic python filter to be feed with input from process the runJobs.log:

fluorine:/home/hashar/refresh.py

mport sys
for line in sys.stdin:
        b=line.split(' ')
        try:
                if len(b) == 18:
                        print b[8], b[9], b[13]
                else:
                        print b[8], b[9], b[10], b[11], b[15]
        except IndexError:
                print "ERR> " + line

Then grep and pipe:

tail -F runJobs.log |grep svwiki.*refreshLinks.*Modul:Wikidata2.*STARTING|python ~/refresh.py|uniq -c
     66 Modul:Wikidata2 pages=array(1) triggeredRecursive=1
      1 Modul:Wikidata2 recursive=1 table=templatelinks range=array(4) triggeredRecursive=1
    379 Modul:Wikidata2 pages=array(1) triggeredRecursive=1
      1 Modul:Wikidata2 recursive=1 table=templatelinks range=array(4) triggeredRecursive=1
    883 Modul:Wikidata2 pages=array(1) triggeredRecursive=1
      1 Modul:Wikidata2 recursive=1 table=templatelinks range=array(4) triggeredRecursive=1
    392 Modul:Wikidata2 pages=array(1) triggeredRecursive=1
      1 Modul:Wikidata2 recursive=1 table=templatelinks range=array(4) triggeredRecursive=1
     11 Modul:Wikidata2 pages=array(1) triggeredRecursive=1
      1 Modul:Wikidata2 recursive=1 table=templatelinks range=array(4) triggeredRecursive=1

And if you feed the whole list of svwiki refreshLink jobs, it is almost entirely made of "Modul:Wikidata".

frwiki currently loop with:

      1 Module:Suivi_des_biographies/données recursive=1 table=templatelinks range=array(4) triggeredRecursive=1
then lot of:
      1 Module:Suivi_des_biographies pages=array(1) triggeredRecursive=1
      2 Module:Suivi_des_biographies/données pages=array(1) triggeredRecursive=1
goto start

So at first: it would be VERY useful to find out what is in the pages argument. It shows up as an array since the key is the page id and the value is itself an array of namespace, pagename. We should hack that eventually. The root job as an uuid and I guess we can retrieve its parameter directly from Redis, I havent found out how to do it though :(

What I suspect is that we have refreshLink jobs that recursively trigger themselves. I even found out that RefreshLinks really has three variants:

* This job comes in a few variants:
*   - a) Recursive jobs to update links for backlink pages for a given title.
*        These jobs have (recursive:true,table:<table>) set.
*   - b) Jobs to update links for a set of pages (the job title is ignored).
*        These jobs have (pages:(<page ID>:(<namespace>,<title>),...) set.
*   - c) Jobs to update links for a single page (the job title)
*        These jobs need no extra fields set.

For the record:

mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php | sort -n -k2 | tail -n10

itwiki 3007
enwiki 1377410
frwiktionary 1964069
frwiki 2280332
ruwiki 2343680
commonswiki 3091008
enwiktionary 3251065
svwiki 3271156
srwiki 3640149
Total 21227748

One possible loop:

RefreshLinksJob::runForTitle()ParserCache::singleton()->save()Hooks::run()Wikibase\Client\Hooks\DataUpdateHookHandlers->doParserCacheSaveComplete()JobQueueGroup->lazyPush()

One possible loop:

RefreshLinksJob::runForTitle()ParserCache::singleton()->save()Hooks::run()Wikibase\Client\Hooks\DataUpdateHookHandlers->doParserCacheSaveComplete()JobQueueGroup->lazyPush()

While that is not nice, we don't do anything in there which could trigger a refreshlinks job as far as I can tell.

Mentioned in SAL [2016-03-16T10:50:59Z] <moritzm> rolling reboot of mw* servers in eqiad (except job runners to not interfere with T129517)

Been looking on frwiki for uuid of claimed jobs via:

mwscript showJobs.php --wiki=frwiki --type=refreshLinks --status=claimed --list|grep -o -P "uuid=\w+"

The uuid that stick around are:

97 uuid=04c5e091bb6b4a4aa8821b7ed50d8859
97 uuid=0c81398fd654461abc60407fa6a8c7a7
97 uuid=573e0bc273784616af95023812317087
97 uuid=bd8f5e26ae7e456da81190e4d9c71d2a

I retrieved each of them from redis servers via HGET frwiki:jobqueue:refreshLinks:h-data <UUID> (though I lost one in the process). The payload is:

UUID 0c81398fd654461abc60407fa6a8c7a7

'a:8:{s:4:"type";s:12:"refreshLinks";s:9:"namespace";i:828;s:5:"title";s:21:"Suivi_des_biographies";s:6:"params";a:5:{s:5:"pages";a:1:{i:1104639;a:2:{i:0;i:0;i:1;s:11:"Fimfxc3xa1rum_2";}}s:16:"rootJobSignature";s:40:"47f6b8861fa4744da0ac8df81823cf8de4e75b90";s:16:"rootJobTimestamp";s:14:"20160309102617";s:9:"masterPos";O:14:"MySQLMasterPos":3:{s:4:"file";s:17:"db1023-bin.001288";s:3:"pos";s:9:"175311471";s:8:"asOfTime";d:1457519307.6145029;}s:18:"triggeredRecursive";b:1;}s:10:"rtimestamp";i:0;s:4:"uuid";s:32:"0c81398fd654461abc60407fa6a8c7a7";s:4:"sha1";s:0:"";s:9:"timestamp";i:1458118294;}'

That one can not be unserialize() because of s:11:"Fimfxc3xa1rum_2" the string length does not match the actual length.

Json for UUID 04c5e091bb6b4a4aa8821b7ed50d8859:

{
    "type": "refreshLinks",
    "namespace": 828,
    "title": "Suivi_des_biographies",
    "params": {
        "pages": {
            "1168990": [
                0,
                "Abomination:_The_Nemesis_Project"
            ]
        },
        "rootJobSignature": "47f6b8861fa4744da0ac8df81823cf8de4e75b90",
        "rootJobTimestamp": "20160309102617",
        "masterPos": {
            "__PHP_Incomplete_Class_Name": "MySQLMasterPos",
            "file": "db1023-bin.001288",
            "pos": "175311471",
            "asOfTime": 1457519307.6145
        },
        "triggeredRecursive": true
    },
    "rtimestamp": 0,
    "uuid": "04c5e091bb6b4a4aa8821b7ed50d8859",
    "sha1": "",
    "timestamp": 1458118277 // 16 Mar 2016 08:51:17 GMT
}

Json for UUID bd8f5e26ae7e456da81190e4d9c71d2a:

{
    "type": "refreshLinks",
    "namespace": 828,
    "title": "Suivi_des_biographies",
    "params": {
        "pages": {
            "1128445": [
                0,
                "Linas_Kleiza"
            ]
        },
        "rootJobSignature": "47f6b8861fa4744da0ac8df81823cf8de4e75b90",
        "rootJobTimestamp": "20160309102617",
        "masterPos": {
            "__PHP_Incomplete_Class_Name": "MySQLMasterPos",
            "file": "db1023-bin.001288",
            "pos": "175311471",
            "asOfTime": 1457519307.6145
        },
        "triggeredRecursive": true
    },
    "rtimestamp": 0,
    "uuid": "bd8f5e26ae7e456da81190e4d9c71d2a",
    "sha1": "",
    "timestamp": 1458115909 // 16 Mar 2016 08:11:49 GMT
}

All four uuid eventually disappeared from the claimed status.

ori closed this task as Resolved.EditedMar 17 2016, 9:47 AM
ori claimed this task.

The code path which was inserting the vast majority of jobs is this one, in RefreshLinksJob.php:

RefreshLinksJob.php
// Convert this into no more than $wgUpdateRowsPerJob RefreshLinks per-title
// jobs and possibly a recursive RefreshLinks job for the rest of the backlinks
$jobs = BacklinkJobUtils::partitionBacklinkJob(
	$this,
	$wgUpdateRowsPerJob,
	1, // job-per-title
	[ 'params' => $extraParams ]
);
JobQueueGroup::singleton()->push( $jobs );

I can't pinpoint it more exactly than that just yet. Tailing the log of articles being refreshed showed a large but finite set of jobs cycling endlessly through the queue. One way this could happen is if there was a cycle somewhere, such that refreshing some page set off a sequence of surrogate jobs which eventually comes back around to scheduling a refresh of the page that initiated the sequence. If that is indeed the case, we may be able to isolate the problem by making jobs keep track of their lineage.

A few hours ago I concluded that other job types have been so badly delayed by this issue for so long that the user impact of dropping RefreshLinks jobs, while substantial, would be smaller than letting the problem continue to fester. So I kludged the condition in line 84 of RefreshLinksJob.php, if ( !empty( $this->params['recursive'] ) ) { ... }, to always be false, figuring that this would still allow RefreshLinks jobs for updating a particular title or titles to run, but would not recurse into backlinks. Note that the overwhelming majority of recursive refreshes that were dropped had already been executing endlessly, so the number of genuinely stale backlinks resulting from this kludge should be small.

Deploying this change made the queue size shrink rapidly, going from 20m items to less than 100k in about an hour. Once RefreshLinks jobs have drained, I reverted the change and re-enabled recursive jobs.

We're going to continue to work on isolating this problem and fixing it properly. (T73853 is the task to follow.) In the interim, we are going to minimize the user impact of any recurrences of this issue by processing RefreshLinks jobs in a separate queue. This should prevent a spike in RefreshLinks jobs from causing the processing of other job types to be delayed.

Thank you for your patience, and sorry for any inconvenience this issue has caused. The job queue has been flexed and stretched over the years in response to scalability challenges, and it has gradually become one of the hardest parts of the stack to reason about. We plan to address this by adding additional metrics, alerts, and logging, and then using those to guide further refactoring work.

@ori awesome really, thank you for the summary. I was trying to figure out whether it was a cycle and which cycle it might be but never managed to extract the proper data :-(

The Job queue is all fine now and shows a nice drop in size. Green bar below is the one hour delta in job size growth:

Capture d’écran 2016-03-17 à 12.15.48.png (580×917 px, 63 KB)

The times are all fine as well (axis is log 10):

Capture d’écran 2016-03-17 à 12.17.44.png (537×840 px, 76 KB)