Job queue is increasing non-stop
Open, HighPublic

Description

This doesn't sound good:

August 21

Current: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1

Mitigation

  • 6898d06f8f Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob. @aaron, https://gerrit.wikimedia.org/r/373705

    In MediaWiki 1.27 (2015) we introduced $wgCdnReboundPurgeDelay (more information at https://www.mediawiki.org/wiki/Manual:$wgCdnReboundPurgeDelay). This means that effectively every purge is actually two purges, spaced out by a few seconds, to account for race conditions between database replication and cache population. However, the common case of recursive updates (e.g. from templates) is already delayed through the JobQueue, so we can omit the rebound purge in that case.

    Fixing this improved job execution throughput, and decreased job queue growth.
  • cb7c910ba Fix old regression in HTMLCacheUpdate de-duplication. @aaron, https://gerrit.wikimedia.org/r/373979

    Following a refactor in mid-2016, the duplication logic for recursive updates started to be miscalculated (based on current time, instead of time from the initial placeholder/root job).

    Fixing this significantly decreased job queue growth.
Dashboard: Job Queue Helath
Job queue size down from 10M to a steady ~2M. Before the regression it was a steady between 100K -1M.
Dashboard: Varnish stats
Purge rate from production Varnish servers reduced by 2-3X, from 75-100K/s to ~30K/s.
Dashboard: Job Queue Rate for htmlCacheUpdate
Queue rate of htmlCacheUpdate back to normal. Deduplication/Superseding optimisation is now working. Execution speed has increased.

Patch-For-Review:

See also:

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

I updated https://gerrit.wikimedia.org/r/#/c/295027/ to apply on current master. This removes CDN purges from HTMLCacheUpdate, and only performs them after RefreshLinks, and only if nothing else caused a re-render since.

With this patch applied, we should be able to reduce the throttling for HTMLCacheUpdate jobs without endangering the CDN infrastructure with bursts of purges.

aaron added a comment.EditedAug 31 2017, 9:05 PM

Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let us run the older jobs faster since if I understand correctly the throttling is to avoid overloading varnish with purges?

Unfortunately the throttling still happens regardless of page touched. Throttling isn't based on actual purges performed but on the number of work items in a job. Work items are a simple count of pages in the job, rather than how many pages will actually be purged. Changing this behavior would basically increase the number of purges we send to varnish.

Seems simple enough to make a Job::getAffectedWorkItems(), defaulting to 0 but updated by the backlink jobs during run(). The getBackoffTimeToWait() call in JobRunner could use the effected count and be moved down a bit, after the Job::run() call.

Mentioned in SAL (#wikimedia-operations) [2017-09-04T07:02:33Z] <_joe_> starting additional runJobs instance for htmlcacheupdate on commons T173710

Mentioned in SAL (#wikimedia-operations) [2017-09-05T07:03:25Z] <_joe_> launching manually 3 workers for refreshLinks jobs on commons, T173710

Joe added a comment.Sep 5 2017, 7:09 AM

We still have around 1.4 million items in queue for commons, evenly divided between htmlCacheUpdate jobs and refreshLinks jobs.

I've started a few runs of the refreshLinks job and since yesterday most jobs are just processing the same root job from August 26th.

Those jobs take a huge amount of time to be processed, see for example

2017-09-05 07:04:23 refreshLinks Special:Badtitle/Wikibase\Client\Changes\WikiPageUpdater pages=array(50) rootJobTimestamp=20170826142341 requestId=34338ba1225a45c9867136ae (uuid=71290d0f9a7d44818f7871f992df7c47,timestamp=1503757432,QueuePartition=rdb1-6380) STARTING
2017-09-05 07:05:46 refreshLinks Special:Badtitle/Wikibase\Client\Changes\WikiPageUpdater pages=array(50) rootJobTimestamp=20170826142341 requestId=34338ba1225a45c9867136ae (uuid=71290d0f9a7d44818f7871f992df7c47,timestamp=1503757432,QueuePartition=rdb1-6380) t=82755 good

Now the fact it's two days we're working on the same root job, and the fact that a single leaf job for this takes 83 seconds when launched with --nothrottle is very bad, IMHO.

Why do we keep running jobs that are 11 days old? Does it even make sense? Why do these jobs take so long to execute?

@aaron do you have any suggestions about what's going on there?

Change 375819 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

aaron added a comment.Sep 5 2017, 8:12 PM

Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 50 pages/job) than the regular ones from MediaWiki core. That is a bit on the slow side for a run time of a non-rare job type (e.g. TMH or GWT).

Emijrp removed a subscriber: Emijrp.Sep 5 2017, 8:15 PM
Joe added a comment.Sep 6 2017, 9:08 AM

Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 50 pages/job) than the regular ones from MediaWiki core. That is a bit on the slow side for a run time of a non-rare job type (e.g. TMH or GWT).

Even with three unthrottled jobs running on terbium, I can't seem to reduce the backlog of refreshLinks jobs on commons. This reminds me of past cases where refreshLinks jobs seem to be processed over and over. However, I ran @aaron's sanity check script on these jobs and in no case I found any job executed more than 3 times, so it just seems to be the case of a huge amount of jobs submitted and that need processing.

This at least shows we need way more capacity to handle large refreshlinks jobs, or that we might need to change the way wikidata links get updated in some way.

In general, I think throttling is important for jobs that impact external entities greatly, typically the database, not for parsing-heavy jobs that are mostly cpu-bound.

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

Yep! Very true. When I first read about Borg and whenever I read about similar principles since, the job queue always comes to mind as a great use case. Of course it would benefit app server maintenance too, but the job queue pressure tends to vary more than app server pressure. A Borg-like system would allow us to make the most of the idle time on all (app) servers and gracefully fill it up with job runners.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database). But the idea is still very attractive. Permanently setting up more job runners remains a difficult calculation for us, because in the end we must prioritise app servers for site availability. On the other hand, given how idle most app servers are most of the time, it seems like a royal waste to not put it to use.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database).

If I could add to the ideal scenario, the jobqueue would have dedicated slaves AND would write with a different domain id (allowing parallelism) than the rest of the writes so we effectively have multiply database resources with potential elasticity.

Joe added a comment.Sep 6 2017, 3:06 PM

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

Yep! Very true. When I first read about Borg and whenever I read about similar principles since, the job queue always comes to mind as a great use case. Of course it would benefit app server maintenance too, but the job queue pressure tends to vary more than app server pressure. A Borg-like system would allow us to make the most of the idle time on all (app) servers and gracefully fill it up with job runners.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database). But the idea is still very attractive. Permanently setting up more job runners remains a difficult calculation for us, because in the end we must prioritise app servers for site availability. On the other hand, given how idle most app servers are most of the time, it seems like a royal waste to not put it to use.

The waste of resources we have in general on our cluster, and the pains we have with jobrunners and esp. videoscaling capacity were my prime motivators for looking into kubernetes, in fact. Even if it lacks the scheduler that borg has, it would be an enormous advantage for us.

Joe added a comment.Sep 7 2017, 12:49 PM

I did some more number crunching on the instances of runJob.php I'm running on terbium, I found what follows:

Wikibase refreshlinks jobs might benefit from being in smaller batches, as many of those are taking a long time to execute. Out of 33.4k wikibase jobs, we had the following distribution of timings:

oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 50000) print res[2] }' | wc -l                        
3418
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 30000) print res[2] }' | wc -l                        
10814
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 20000) print res[2] }' | wc -l                        
13430
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 10000) print res[2] }' | wc -l                        
16949
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 5000) print res[2] }' | wc -l
21394

As you can see, about 10% of jobs take 50 seconds or higher to execute, and about 64% take more than 5 seconds to execute, while I think it should be expected to have 99% of the jobs execute within 5 seconds. Also, I could see these jobs easily exceeding the maxtime of Jobrunner::Run, which is set to 30 seconds in production.

Also, there is no way with the current jobqueue to schedule jobs from wikis with more backlog than others, so for example jobrunner submitted only 250 batches/server for refreshLinks on commons yesterday, way less than the number posted for itwiki (around 350/400 per server), whose queue is 4k elements long right now, versus the 680k elements in the queue for commonswiki. This resulted in a total of 37.1k refreshLinks jobs for commonswiki being executed by the jobrunner infrastructure in a full day.

For comparison, my three threads on terbium completed a total of 35.7k jobs in the same interval.

It is pretty clear that unless we have a better scheduler or manual ways to control the jobqueue processing priority, there is no way we can recover a 700k items lag anytime soon.

I'll let my threads work to fight this specific fire, but either we fix things in the jobqueue, or we expect such things to happen until we have fully migrated to the new change-propagation backed transport that should make some of these controls easier to work with.

Could, at least, that part have something to do with T164173, as a problem from the same cause, or a consequence of the fix? I also remember some tunning of some wikidata crons or job size, but not sure if upwards or downwards and not sure if related. Aaron, Ladsgroup or Daniel may know if that could have impacted positively or negatively or can have further actionables.

One thing I mentioned architecture-wise is that "change tracking" (for purges, invalidations, references, shared data refresh, etc.) could be, to some extent, functionally separated from the rest of the real database metadata in some cases (rather than sharing resources with it).

Wikibase refreshlinks jobs might benefit from being in smaller batches

+1 on this. As we have now all jobs being emitted to EventBus as well, we have had Kafka reject a portion of the jobs because they were larger than 4MB each. Upon inspection, we saw that these were endless lists of titles sent as one refreshLinks job by Wikibase.

I made the batch smaller from 100 to 50 and I can do it to 20. Let me make a patch.

Change 376562 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

This patch can go in when commons is on wmf.17. Sooner, it's useless. (See T174422: Make dbBatchSize in WikiPageUpdater configurable)

Change 377046 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 376562 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

Change 377458 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Change 377458 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Mentioned in SAL (#wikimedia-operations) [2017-09-12T13:12:13Z] <hashar@tin> Synchronized wmf-config/Wikibase-production.php: Reduce wikiPageUpdaterDbBatchSize to 20 - T173710 (duration: 00m 45s)

Change 377811 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Harej removed a subscriber: Harej.Sep 13 2017, 6:17 PM
Tbayer added a subscriber: Tbayer.Sep 13 2017, 9:09 PM

Change 378719 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Change 375819 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

Joe added a comment.Sep 19 2017, 12:55 PM

FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurrency of those jobs.

YOUR1 added a subscriber: YOUR1.Sep 20 2017, 7:18 AM

Change 377046 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 378719 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Delays with pushing updates into search could potentially be related to the job queue. More than 12 hours is pretty exceptional for processing these, but the refreshLinks job has to run and on completion that triggers the search index update jobs. refresh links is one of the ones that we've been seeing backup from time to time.

The jobqueue size just bumped to 12M in two days and it's not going down. I don't know if it's related to wikidata or not but that's something people need to look into.

Joe added a comment.Sep 29 2017, 1:53 PM
oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 104355 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 2073193 queued; 44 claimed (21 active, 23 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 1583627 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
commonswiki:  cirrusSearchLinksUpdate: 5311248 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 94729 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 128373 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 25677 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 83152 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 33670 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 18401 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 91297 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 94906 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 1102450 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 518089 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1083039 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 144734 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ukwiki:  refreshLinks: 14833 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
zhwiki:  refreshLinks: 23192 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
zhwiki:  htmlCacheUpdate: 19334 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

It is pretty clear to me that one of the reasons was a namespace move we had on commons, but the underlying problem is that the amount of refreshlink jobs and htmlcacheupdate jobs has spun out of control.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I'm not sure if we need more hardware, or just more effective use of the current hardware. The cirrus jobs in particular are almost entirely bound by network latency, and can be run at significantly higher rates than they are now. Over the course of an hour I ramped up the speed at which these jobs were processing (with some bare hhvm processes on 9 eqiad job runners using runJobs.php) to about 200 extra job runners. Total job queue throughput has increased significantly from 60k jobs/minute to 100k jobs/minute and the job runners themselves are still at ~40% idle. This of course is hard to generalize to jobs in general though, as they will use remote resources that may or may not be available. I happen to know what this specific job will do and how it should behave, but just generally increasing # of job runners per server across the fleet is perhaps not as easy to understand what will happen.

mxn added a subscriber: mxn.Oct 14 2017, 7:31 PM
debt added a subscriber: debt.

Taking off the Discovery-Search tag, as there isn't much we can do, but we'll continue to monitor using the Discovery tag.

elukey added a subscriber: elukey.Oct 20 2017, 11:58 AM

Updated list (showjob1.txt contains group1, showjob.txt group2)

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob1.txt
commonswiki:  refreshLinks: 1629991 queued; 5 claimed (2 active, 3 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 2470968 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob.txt
arwiki:  refreshLinks: 142867 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 198068 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 807144 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 916539 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1582067 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 186069 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

The job queue is currently around 8/9M elements and the trend doesn't seem to improve.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

Just as FYI we added new capacity in T165519 (now we have 19 jobrunners running in eqiad), but in theory we'd need to eventually decom mw[1161-1167] for T177387 to free some rack space and complete T165519 (other jobrunners will eventually be added as part of the task but we need to free rack space first).

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

I have thought a bit about ways to mitigate this. Here are three things I think could help:

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

Hello, I'm a technician at ruwiki, and our wiki is one of those that were experiencing the T171027 problem the most. In the same time, I've noticed in the stats presented in the comments above that the numbers for ruwiki are constantly one of the highest. Could it be connected?

RP88 added a subscriber: RP88.Tue, Oct 24, 2:06 PM

Updated status:

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 13566 queued; 6 claimed (6 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 1991671 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 3760683 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 120524 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
bewiki:  refreshLinks: 34551 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 142590 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 150593 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
dewiki:  htmlCacheUpdate: 11027 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 69933 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 127930 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 41595 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 95960 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 240479 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  htmlCacheUpdate: 70493 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 985639 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 1928674 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 679490 queued; 8 claimed (8 active, 0 abandoned); 0 delayed

We could try to increment the number of runners dedicated to refreshLinks and htmlCacheUpdate and see if we manage to process the backlog?

Change 386636 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::jobrunner: raise temporarily runners for refreshLinks/hmtlCacheUpdate

https://gerrit.wikimedia.org/r/386636

thiemowmde updated the task description. (Show Details)Fri, Oct 27, 12:41 PM

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Change 386636 merged by Elukey:
[operations/puppet@production] role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate

https://gerrit.wikimedia.org/r/386636

Mentioned in SAL (#wikimedia-operations) [2017-10-30T08:42:42Z] <elukey> raised priority of refreshlink and htmlcacheupdate job execution on jobrunners (https://gerrit.wikimedia.org/r/#/c/386636/) - T173710

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Hi! We are trying to solve the issue from two sides, namely trying to produce less jobs and prioritizing more the consumption of the current backlog (mostly htmlCacheUpdate and RefreshLinks jobs). At the moment I don't think there is any good advice for local communities, we are hoping to reduce the backlog soon but it might take a while :(

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

Yes, that's the reason.

We had some relief after the last change in the configs of the jobrunners, namely the queue started shrinking, but then we got back into the bad behavior in which we have constantly more jobs enqueued vs completed:

I am currently seeing some big rootjobs with timestamp around Oct 27th that keep seeing jobs executed, but I failed to track down what it has generated them. If anybody has any idea about what procedure to follow to track down the root cause of this job queue increase please come forward :)

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

Thanks! I tried to spot check in logstash but I am able to see only the request that starts from the jobrunner (the one executing the job), not much more .. :(

https://gerrit.wikimedia.org/r/#/c/385248 is really really promising, not sure when it will be deployed but it would surely help in finding quickly a massive template change or similar.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Thu, Nov 2, 12:49 PM

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

elukey added a comment.Thu, Nov 2, 5:47 PM

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

We are basically lagging 2/3 days in executing jobs, but the queue keeps growing and I have no idea if we are facing something like T129517#2128754 or a 'genuine' (recursive) job enqueue explosion due to a template modification or similar.

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

elukey added a comment.Fri, Nov 3, 8:13 AM

Status:

elukey@terbium:~$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php |sort -n -k2 | tail -n 20
euwiki 237
tgwiki 3759
cawiki 4822
enwiktionary 17148
zhwiki 19958
nowiki 21167
wikidatawiki 28257
bewiki 110296
arwiki 132139
ukwiki 132246
dewiki 155322
svwiki 179250
frwiki 214327
hywiki 504377
itwiki 512539
cewiki 593156
enwiki 654998
ruwiki 5274159
commonswiki 8059943

Total 16619065

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

The unknown causes may also stem from the fact that the patch was not active when the initial job was executed, and so its descendants can't know the cause.

Change 388416 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Change 388416 merged by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Mentioned in SAL (#wikimedia-operations) [2017-11-03T10:39:07Z] <oblivian@tin> Synchronized wmf-config/CommonSettings.php: Increase concurrency of htmlCacheUpdate jobs T173710 (duration: 00m 48s)

XXN removed a subscriber: XXN.Fri, Nov 3, 6:17 PM

Change 389427 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Change 389427 merged by Giuseppe Lavagetto:
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Mentioned in SAL (#wikimedia-operations) [2017-11-06T09:37:49Z] <_joe_> manually running htmlCacheUpdate for commonswiki and ruwiki on terbium, T173710

elukey moved this task from In Progress to Stalled on the User-Elukey board.Tue, Nov 7, 10:32 AM
Krinkle updated the task description. (Show Details)Wed, Nov 8, 9:03 AM