Job queue is increasing non-stop
Open, HighPublic

Description

This doesn't sound good:

August 21

Current: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1

Mitigation

  • 6898d06f8f Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob. @aaron, https://gerrit.wikimedia.org/r/373705

    In MediaWiki 1.27 (2015) we introduced $wgCdnReboundPurgeDelay (more information at https://www.mediawiki.org/wiki/Manual:$wgCdnReboundPurgeDelay). This means that effectively every purge is actually two purges, spaced out by a few seconds, to account for race conditions between database replication and cache population. However, the common case of recursive updates (e.g. from templates) is already delayed through the JobQueue, so we can omit the rebound purge in that case.

    Fixing this improved job execution throughput, and decreased job queue growth.
  • cb7c910ba Fix old regression in HTMLCacheUpdate de-duplication. @aaron, https://gerrit.wikimedia.org/r/373979

    Following a refactor in mid-2016, the duplication logic for recursive updates started to be miscalculated (based on current time, instead of time from the initial placeholder/root job).

    Fixing this significantly decreased job queue growth.
Dashboard: Job Queue Helath
Job queue size down from 10M to a steady ~2M. Before the regression it was a steady between 100K -1M.
Dashboard: Varnish stats
Purge rate from production Varnish servers reduced by 2-3X, from 75-100K/s to ~30K/s.
Dashboard: Job Queue Rate for htmlCacheUpdate
Queue rate of htmlCacheUpdate back to normal. Deduplication/Superseding optimisation is now working. Execution speed has increased.

See also:

There are a very large number of changes, so older changes are hidden. Show Older Changes

The jobqueue has slowed down but still increasing, and cirrusSearchIncomingLinkCount still increases the jobqueue with rate of 100 jobs/second.

Cirrus link counting jobs are probably just a symptom of the backlog of refresh links jobs. Basically every time a refresh links job finishes it queues up a cirrus job to propogate the link updates from mysql to cirrussearch. Note also that the count of refresh links jobs in the queue is relatively undercounted, because an individual refresh links job may re-queue itself many times as it recursively distributes the work contained within it. So a single refresh links root job will turn into quite a few cirrus link counting jobs.

aaron added a comment.Thu, Aug 24, 7:07 PM

In other words, base jobs for entities that will divide up and purge all backlinks to the given entity. Note that each job has two entries.

Wait - each job has two entries? You mean, there are duplicates inserted, and not pruned?...

No, just two entries in the screen output of that script (STARTING and DONE).

EBernhardson added a subscriber: BBlack.EditedThu, Aug 24, 8:02 PM

I've been doing some testing with the HTMLCacheUpdate jobs against relatively low traffic wikis with high numbers of jobs (viwki and srwiki with 2M and 600k page views/day respectively). It seems like almost the entirety of the time being taken is to allow for throttling. The throttling for this is currently set to 30 'work items' per second. I'm not entirely sure how that works out, but it looks like that's 30 work items per server running jobs(of which i count 15 in eqiad)? Giving us roughly 450 items per second, 27k per minute, 1.6M per hour, and 38M per day.

Talking to @BBlack there is also a coninciding increase in purges running on the varnish hosts: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?orgId=1&var-site=All&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5&from=now-90d&to=now

Basically around 8/8, the same time the job queue started growing, purge rates increased dramatically and stayed there. Somehow we are creating significantly more purges and they are probably linked to something deloyed around 8/8.

Some background from bblack about the cache purge pipeline:

A) Sometime in the distant past, the way it worked is that when an edit/delete POST request came in, at the end of the request (after sending the response), there's some kind of post-response hook for async processing (before handling another request), and that's where we sent out an immediate, singualr multicast purge towards varnish.

B) Sometime later, at least some of the purging was moved to going through JobQueue.  Relatedly, sometime later, someone decided to patch over the rare "lost purge" issue by having MediaWiki send out 3x copies of a given purge, spaced out by several seconds.  I think this had to be via the jobqueue (as we wouldn't stall post-response handling to do that)

C) I'm not sure, once those things were in play, whether an immediate post-response purge packet was still being sent, in addition to firing off a jobqueue thing for the delayed repeats.  or if now all purges only went via jobqueue and never immediate.

 D) In addition, sometime since 2015 we've started doing a much "better" job of handling templates/transclusions/etc... where the invalidation of a template goes into jobqueue, and then jobqueue recurses through all the dependent objects that got                 touched and purges those too... we may have gone through multiple steps of "improvement" on this front, generating at each step far larger volumes of purge traffic

And a suggestion from jynus about something that changed around that same time period, not sure how to check if this is related:

13:07 < jynus> interesting, that day is when commons article definition was changed
13:07 < jynus> https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2017/07#Should_content_pages_consist_of_galleries_only_or_also_include_File_pages.3F

The increase in purges happens between 8/7 08:00 and 8/8 08:00. Looking at the SAL log there is really nothing interesting in there, train didn't roll forward till later (and then only to group0). Something other than a deployment seems to have triggered this increase.

Well, it's dropped by ~1.5M jobs in the last couple of hours and seems to be now more slowly draining the pool.

Related previous tickets about purge problems/increases: T124418 T133821

EBernhardson added a comment.EditedThu, Aug 24, 8:40 PM

Well, it's dropped by ~1.5M jobs in the last couple of hours and seems to be now more slowly draining the pool.

Thats because i ran all the htmlCacheUpdate jobs on srwiki (~2M) with throttling disabled to see what kind of effect it has. The effect was to increase the global request rate to varnish by 3x which isn't sustainable to clear out the queues.

Perhaps interestingly as well, this resulted in approximatly 2 billion purge requests to varnish, which after de-duplicating is approximatly 9k "actual page" purges per second over the 1.5 hours it took to run the jobs. That means for a wiki with 1.5M pages in the page table we had jobs to purge 43M pages.

Secondary purges where for dealing with replication lag scenarios, not lost purges. That was one extra purge (2X).

One easy change I can see to not use CdnCacheUpdate from HtmlCacheUpdateJob (but still for the pages directly being edited). There is already processing delay anyway (and if there is none, there less likely to be replag, though not guaranteed), so there is less "de facto" use in a secondary purge for backlinks.

Change 373705 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob

https://gerrit.wikimedia.org/r/373705

Note necessarily a cause, but while looking into viwiki's backlog, i noticed this bot which seems to be creating an incredible number of purge jobs: https://vi.wikipedia.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:%C4%90%C3%B3ng_g%C3%B3p/TuanminhBot?uselang=en

Secondary purges where for dealing with replication lag scenarios, not lost purges. That was one extra purge (2X).

One easy change I can see to not use CdnCacheUpdate from HtmlCacheUpdateJob (but still for the pages directly being edited). There is already processing delay anyway (and if there is none, there less likely to be replag, though not guaranteed), so there is less "de facto" use in a secondary purge for backlinks.

That said, there is still some extension or underlying user pattern that I suspect is the underlying cause. Sacrificing rebound purges will cut purges in half, and it's easy to do, hence my patch above.

Change 373705 merged by jenkins-bot:
[mediawiki/core@master] Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob

https://gerrit.wikimedia.org/r/373705

EBernhardson added a comment.EditedThu, Aug 24, 11:34 PM

Parsing through the existing jobs with a script to measure the "real" number of purges that will be issued: P5916 I'm finding that on the wikis with the largest queues the majority of the jobs do nothing. The first 50k jobs for a few wikis:

wikidatawiki: Empty: 44166 Jobs: 5834 Real: 12224 Per real job: 2.0953033938978
viwiki: Empty: 49271 Jobs: 729 Real: 4658 Per real job: 6.3895747599451
enwiki: Empty: 15407 Jobs: 19593 Real: 45418451 Per real job: 2318.0957995202
(only the first 35k on enwiki, because the avg purges per job is a bit crazy)

Now, i suppose the recent change to not count recursive jobs as work items should help resolve this, but since these jobs are already done via deferred updates after closing the request that caused them to be created, maybe we could drastically cut down on the number of jobs inserted by resolving the first level of recursion directly when creating the job? That would skip inserting quite a few jobs and prevent them from being backlogged by the throttling. On the other hand this should have no actual effect on purge rates, other than reducing the number of jobs that do nothing in the job queue.

Additionallly, for further proof that stronger deduplication of some form would probably be useful, for the first 35k htmlCacheUpdate jobs on enwiki (out of 500k) here are the number of times a page will be purged, and the number of purges:

281202 1
 34050 2
 52655 3
  9963 4
 17371 5
 67062 6
292181 7
 16691 8
 47153 9
 38499 10
 28976 11
379366 12
 35656 13
 19942 14
 12421 15
 27241 16
 19461 17
 10800 18
 37016 19
 26505 20
 11661 21
 19081 22
 14047 23
 17688 24
 13908 25
 29007 26
  7133 27
  6557 28
  9027 29
  8828 30
  5671 31
 22547 32
  6216 33
 13846 34
  3830 35
  5553 36
 15783 37
 37035 38
  6487 39
 31538 40
   969 41
   611 42
 16303 43
 31132 44
  5635 45
 15577 46
 34817 47
 53171 48
  1775 49
 45819 50
 49742 51
228543 52
  8991 53
    11 54
     1 56

This can be read as "228,543 pages will be purged 52 times each". Note this is only for the first 1k jobs in enwiki, there are just shy of 500k jobs currently queued.

This is probably a symptom and not a cause, but I wanted to comment it anyway in case it was interesting:

There seems to be higher than usual hhvm exceptions:
https://logstash.wikimedia.org/goto/80fa5708f0a5e9da4be9f4630969b72e
Most of those, at least the ones that are known, seem to be coming from the job queue, ChangeNotification and RefreshLinks, among others. They possibly are getting better lately.

aaron added a comment.Fri, Aug 25, 8:03 PM

Note that for de-duplication, as long as the job has rootJobTimestamp set, it will ignore rows already touched (page_touched) to a higher/equal value, and likewise not send purges to the corresponding pages. So the CDN aspects *should* already have lots of de-duplication, the job spam notwithstanding.

EBernhardson added a comment.EditedFri, Aug 25, 8:22 PM

Note that for de-duplication, as long as the job has rootJobTimestamp set, it will ignore rows already touched (page_touched) to a higher/equal value, and likewise not send purges to the corresponding pages. So the CDN aspects *should* already have lots of de-duplication, the job spam notwithstanding.

I'm not sure that is working correctly. When i cleared out the queue for srwiki by ignoring the throttle varnish recorded an increase of about 2B purges over 90 minutes. There are 32 text varnishes, so that is around 62.5M urls purged for a wiki that has only 1.5M rows in the page table. It looks like Title::getCdnUrls() returns ~10 things, so that should have only be 15M or so urls to purge.

I suppose even if that was working though, the ignored purges would still count as work items, so they would still delay further purges?

aaron added a comment.EditedFri, Aug 25, 9:09 PM

Ignored purges still count as work items, yes.

Rebound purges could explain some of the number. Redirects/variants also cause multiple purges per title.

Given the backlog, lots of them probably had actually different rootJobTimestamps. MediaWiki can de-duplicate those when it's the same backlinked page X being edited several times by ignoring the older timestamp ones. It's trickier when templates A and B are edited and the backlinks overlap. Sometimes that gets caught, other times both purges to same page happen.

If htmlCacheUpdate queue was LIFO instead of FIFO, then the higher timestamp purges would run first more often and the lower ones would no-op given the SELECT query...that might be where the most de-duplication opportunities are missed. It mostly relies on non-parallel execution of jobs causing the range->root job division, and leaf job execution for different template/file edits to be intertwined. Whether the job with the higher rootJobTimestamp runs first or vise versa is luck based. When it's the former, then the purge is de-duplicated on the DB/CDN layer. Making that queue LIFO would nullify the rootJobSignature/timestamp deduplication however (e.g. several edits to template A).

I guess visually, the limitations on per-page deduplication is like:

Edit to A (t1):
Queue: JobA, <prior jobs> [tail: left, head:right]
Edit to B (t2):
Queue: JobB, <prior jobs>, JobA, <prior jobs>
As jobs run:
Queue: JobAremnant,JobAleaf1, ..., JobAleaf500, <prior jobs>, <jobB>, <prior jobs>
Queue: JobBremnant,JobBleaf1, ..., JobBleaf500, <prior jobs>, JobAremnant,JobAleaf1, ..., JobAleaf500, <prior jobs>

So the page A jobs from t1 run and *then* later the B jobs from t2. This tends to repeat as the remnant jobs divide up info leaf jobs. Leaf jobs are de-duplicated queue-level (prior job wins), but that only applies to overlap between the current leaf job batches in the queue. The queue doesn't "know" that a later job will touch some of the same pages after some remnant job sub-divides.

EDIT: this isn't really a problem since we already set page_touched to the current timestamp at job run-time if it's set at all. The order of whether the A or B jobs run for any given title X shouldn't matter.

Though this bit is problematic:

"page_touched < " . $dbw->addQuotes( $dbw->timestamp( $touchTimestamp ) )

...seems like that comparison should use rootJobTimestamp if present.

Change 373984 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.30.0-wmf.15] Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob

https://gerrit.wikimedia.org/r/373984

Esc3300 updated the task description. (Show Details)Sat, Aug 26, 5:00 PM

Change 373984 merged by jenkins-bot:
[mediawiki/core@wmf/1.30.0-wmf.15] Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob

https://gerrit.wikimedia.org/r/373984

Mentioned in SAL (#wikimedia-operations) [2017-08-28T14:28:13Z] <zfilipin@tin> Synchronized php-1.30.0-wmf.15/includes/jobqueue/jobs/HTMLCacheUpdateJob.php: SWAT: [[gerrit:373984|Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob (T173710)]] (duration: 00m 45s)

Krinkle triaged this task as High priority.Wed, Aug 30, 6:44 PM
GWicke added a subscriber: GWicke.Wed, Aug 30, 10:56 PM

A possible contribution to the backlog building could be the infinite retry / immortal job problem described in T73853. Looking for old htmlCacheUpdate root jobs from April still executing over four months later (!) via grep htmlCacheUpdate runJobs.log | grep -c 'rootJobTimestamp=201704' in mwlog1001:/srv/mw-log yields 9208 executions, just today. Interestingly, jobs from May, June, and July are much less common (hundreds). Considering that HTMLCacheUpdateJob basically only updates touched timestamps in the DB, and then quickly fires off CDN purges, seeing anything but zero ancient jobs might mean that T73853 is not actually resolved yet. To actually establish whether this significantly contributes to the current backlog, we would need to look at the distribution of rootJobTimestamp values for htmlCacheUpdates from July, especially for the period since the backlog growth really started around the 8th.

The general HTMLCacheUpdate / purge volume problematic was previously discussed in T124418. At the time, I posted https://gerrit.wikimedia.org/r/#/c/295027/, which would move the vast majority of CDN purges to RefreshLinksJob, which would make purges less bursty. I think we could dust this off quite easily.

Finally, since EventBus and Kafka was brought up, let me clarify: No jobs are executed via EventBus / Kafka so far. We did start writing copies of job specs to EventBus on August 2nd (phase0), and then enabled this for phase1 on the 16th. The timing does not align with the backlog rise, so it seems unlikely that the double production significantly contributes to this issue.

HTMLCacheUpdate root job timestamp distribution, jobs executed within the last 15 hours:

   1233 20170407
   8237 20170408
     18 20170423
     18 20170426
     20 20170429
     50 20170430
     18 20170502
     18 20170504
     20 20170509
     10 20170512
     18 20170513
     16 20170523
     22 20170528
     10 20170529
     40 20170606
     20 20170617
     18 20170622
     21 20170625
     16 20170627
     10 20170628
     10 20170630
     36 20170701
     20 20170705
     28 20170708
     18 20170712
     10 20170715
     16 20170717
     18 20170724
     42 20170725
     20 20170726
     20 20170728
     17 20170729
     34 20170803
     46 20170804
     30 20170805
     50 20170807
     54 20170808
    260 20170809
    137 20170810
     16 20170811
     17 20170812
     84 20170813
     36 20170814
     10 20170815
     72 20170816
    445 20170817
     82 20170818
     67 20170819
  21452 20170820
   1825 20170821
     81 20170822
    176 20170823
   4810 20170824
   9773 20170825
  21842 20170826
 218770 20170827
   8087 20170828
 183142 20170829
3805398 20170830
Krinkle updated the task description. (Show Details)EditedWed, Aug 30, 11:26 PM

Added a mitigation section to the task description. Also a summary of the impact of the mitigations so far (based on input from @aaron).

Dashboard: Job Queue Helath
Job queue size down from 10M to a steady ~2M. Before the regression it was a steady between 100K -1M.
Dashboard: Varnish stats
Purge rate from production Varnish servers reduced by 2-3X, from 75-100K/s to ~30K/s.
Dashboard: Job Queue Rate for htmlCacheUpdate
Queue rate of htmlCacheUpdate back to normal (better than previous norm, actually). Deduplication/Superseding optimisation is now working. Run rate has improved.
Krinkle updated the task description. (Show Details)Wed, Aug 30, 11:27 PM

As far as retries go, the attempts hash for wikidatawiki:htmlCacheUpdate has few entries with run counts no greater than 3. The onl incrementing code is doPop() in MediaWiki, the same code that made them go up to 3 to begin with. If the same job ran many times, I'd expect there to be very high values there.

> aaron@terbium:~$ mwscript eval.php wikidatawiki
> 

> error_reporting( E_ALL );

> require("/home/aaron/eval_job_check.php");

> foreach ( $wmfLocalServices['jobqueue_redis'] as $tag => $host ) { sanityCheckJQHost( $host, wfWikiId(), 'htmlCacheUpdate' ); }
array(6) {
  ["743f54ce7b8843d8b6e4ec081f633508"]=>
  string(1) "3"
  ["ee20490772484aae905592ce6a4bc22c"]=>
  string(1) "3"
  ["a45d1c46edc8450a90da89668cbe1924"]=>
  string(1) "3"
  ["0083c49d9dec492d99ee7ea95ab25403"]=>
  string(1) "3"
  ["b1f4cb9f1b9c4402b9f8da2348d6a46f"]=>
  string(1) "3"
  ["2edd120f3b1a42edb3645d2dd777bf81"]=>
  string(1) "3"
}
array(3) {
  ["65d41242504d4e4198b1213da1d3536c"]=>
  string(1) "3"
  ["c2ceaffe86274a56b3b491899e3e3594"]=>
  string(1) "3"
  ["f38d9c0116e7438b9c8d9a8ae6f9430e"]=>
  string(1) "3"
}
array(3) {
  ["720afb9160b542b896820a8d069910c2"]=>
  string(1) "3"
  ["3407d8dd224840c2bf79c36b55bc311a"]=>
  string(1) "3"
  ["1f67fd5e59914a4686bee0877c4b935f"]=>
  string(1) "3"
}
array(2) {
  ["9aa931c3f3444cc0bd9bfa8ff3097062"]=>
  string(1) "3"
  ["a5bc6d9346f84a87ad4829edf096b977"]=>
  string(1) "3"
}
array(1) {
  ["46677062e9e74d048541f1b8dab3c63a"]=>
  string(1) "3"
}
array(3) {
  ["45b63ee504dd4f798956f6900079f452"]=>
  string(1) "3"
  ["7992a032bebf45b9a686991dc29a24b4"]=>
  string(1) "3"
  ["935f44b5c3d64dd392f29e8e8e94963b"]=>
  string(1) "3"
}
array(3) {
  ["ef36284c42da45cfa667419c820d17c6"]=>
  string(1) "3"
  ["6803b9c714b545a59d1830c5ab55ec60"]=>
  string(1) "3"
  ["832aa8f83c1f475dabc56e256f22ea84"]=>
  string(1) "3"
}
array(2) {
  ["dc12814a0b214c6d94f054aca4201115"]=>
  string(1) "3"
  ["60fc4b1e6b354189982add7dfabccf25"]=>
  string(1) "3"
}
array(4) {
  ["935610fad21c4d2eb8336cb594f57afb"]=>
  string(1) "3"
  ["0a4bfacccd8d48258f8b7689b99f3180"]=>
  string(1) "3"
  ["a381aed77ce94bec872aaebf8b96016b"]=>
  string(1) "3"
  ["e8b9c16c9d3848c38ab0c44556a7d2e4"]=>
  string(1) "3"
}
array(4) {
  ["6f4dd16a084d486dab52658a4ea54c37"]=>
  string(1) "3"
  ["0e7f6a92e6eb4bb8a121047f869c3f6e"]=>
  string(1) "3"
  ["fd18564d792f4d9f82f45a1e42c46973"]=>
  string(1) "3"
  ["3b786e6a7dfd4f2fb4a9f924f160fcba"]=>
  string(1) "3"
}
array(4) {
  ["0b13d238fa554706a08a5b2160a66e1e"]=>
  string(1) "3"
  ["8c35814276f04985b7158081acfb8dbf"]=>
  string(1) "3"
  ["1c3accd2123a4159aa7ee2e95628ad29"]=>
  string(1) "3"
  ["d5a4d5bb391d4192ab1af5a9caee9f46"]=>
  string(1) "3"
}
array(3) {
  ["ce5df11aaecc4bf9a641787c9bc41e9e"]=>
  string(1) "3"
  ["16bd168b60e348bfab39e7a8921a99a1"]=>
  string(1) "3"
  ["93ff062e5b00463e9efcda7604274112"]=>
  string(1) "3"
}

A page with 1 million backlinks would have (job divisions)x(leaf jobs per division) + (jobs that just divide into other jobs) = ( 1e6 / 300 * 3 + 1e6 / 300 = ~13334 job runs (if none failed), and they would all have the same rootJobTimestamp. The number of jobs with the same minute prefix would be higher (different rootJobSignature values though). The only thing odd about the table @GWicke posted is how old some root job descendants are.

Since job divisions go to the end of the queue (like any other job pushed), it will make it trickier to reason about timing. The oldest job in the queue might be to a page with a lot of backlinks. Each division puts the leaf and remnant (the one to divide) jobs at the end of the queue. The runners have to burn through the queue to get to the remnant job. This cycle repeats until it's done. When the queue has any serious length, this means it might take a long time to finish some old template backlink refresh/purge. During the increase, jobs kept piling up, meaning each iteration of old many-backlink job would take a long time to even get to the next division, stretching it out further than just a continuous one-off backlog of back-to-back jobs.

In any case, if there was an loop it would probably be in the job division itself. That code for that is largely in BacklinkJobUtils, which both htmlCacheUpdate and refreshLinks use.

Change 373521 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Decrease dbBatchSize in WikiPageUpdater

https://gerrit.wikimedia.org/r/373521

Joe added a subscriber: Joe.Thu, Aug 31, 2:10 PM

@aaron so you're saying that when we have someone editing a lot of pages with a lot of backlinks we will see the jobqueue growing basically for quite a long time, as the divided jobs will be executed at a later time, and as long as the queue is long enough, we'll see jobs divided/inserted in the queue when division jobs are executed.

Now, given the max TTL on our caches is hard capped at 1 day, I think we might want to drop any job whose rootjobTS is farther in the past than 1 day.

Does this make sense?

Joe added a subscriber: ema.Thu, Aug 31, 2:16 PM

Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we should process any job with a root timestamp newer than 4 times the cache TTL cap. So anything older than 4 days should be safely discardable.

This would account for about 1% of jobs according to Gwicke's sampling, but I suspect that under large pressure the distribution could get significantly worse.

@Joe, that might be true for the htmlCacheUpdate jobs, but not for the refreshLinks jobs. From my understanding, the refreshLinks jobs should be processed even if they are older than the max TTL, because discarding those jobs only because they are old would make the categories, backlinks,... less accurate.

aaron added a comment.Thu, Aug 31, 6:15 PM

Correcting myself after a discussion with @ema: since we have up to 4 cache layers (at most), we should process any job with a root timestamp newer than 4 times the cache TTL cap. So anything older than 4 days should be safely discardable.

This would account for about 1% of jobs according to Gwicke's sampling, but I suspect that under large pressure the distribution could get significantly worse.

I'd be careful about parser cache and an extensions using page_touched to validated cached values. Discarding jobs might break some assumptions there.

Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let us run the older jobs faster since if I understand correctly the throttling is to avoid overloading varnish with purges?

EBernhardson added a comment.EditedThu, Aug 31, 6:26 PM

With the cache update problem looking mostly resolved, the remaining top queues in the job queue (as of aug 31, 1am UTC):

commonswiki:  htmlCacheUpdate: 809453 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 532823 queued; 5492 claimed (8 active, 5484 abandoned); 0 delayed
ruwiki:  refreshLinks: 253178 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 100935 queued; 7 claimed (7 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 83638 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 76763 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 46213 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 17155 queued; 14 claimed (14 active, 0 abandoned); 0 delayed

commonswiki is still backlogged on html cache updates, and refresh links jobs seem oversized across a few wikis. RefreshLinks is mostly cpu bound, iiuc, running the wikitext parser. The flamegraph for this job looks like:

Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let us run the older jobs faster since if I understand correctly the throttling is to avoid overloading varnish with purges?

Unfortunately the throttling still happens regardless of page touched. Throttling isn't based on actual purges performed but on the number of work items in a job. Work items are a simple count of pages in the job, rather than how many pages will actually be purged. Changing this behavior would basically increase the number of purges we send to varnish.

I updated https://gerrit.wikimedia.org/r/#/c/295027/ to apply on current master. This removes CDN purges from HTMLCacheUpdate, and only performs them after RefreshLinks, and only if nothing else caused a re-render since.

With this patch applied, we should be able to reduce the throttling for HTMLCacheUpdate jobs without endangering the CDN infrastructure with bursts of purges.

aaron added a comment.EditedThu, Aug 31, 9:05 PM

Could we always bump page_touched, but only send the purges to varnish if the timestamp is within the past four days? Would that let us run the older jobs faster since if I understand correctly the throttling is to avoid overloading varnish with purges?

Unfortunately the throttling still happens regardless of page touched. Throttling isn't based on actual purges performed but on the number of work items in a job. Work items are a simple count of pages in the job, rather than how many pages will actually be purged. Changing this behavior would basically increase the number of purges we send to varnish.

Seems simple enough to make a Job::getAffectedWorkItems(), defaulting to 0 but updated by the backlink jobs during run(). The getBackoffTimeToWait() call in JobRunner could use the effected count and be moved down a bit, after the Job::run() call.

Mentioned in SAL (#wikimedia-operations) [2017-09-04T07:02:33Z] <_joe_> starting additional runJobs instance for htmlcacheupdate on commons T173710

Mentioned in SAL (#wikimedia-operations) [2017-09-05T07:03:25Z] <_joe_> launching manually 3 workers for refreshLinks jobs on commons, T173710

Joe added a comment.Tue, Sep 5, 7:09 AM

We still have around 1.4 million items in queue for commons, evenly divided between htmlCacheUpdate jobs and refreshLinks jobs.

I've started a few runs of the refreshLinks job and since yesterday most jobs are just processing the same root job from August 26th.

Those jobs take a huge amount of time to be processed, see for example

2017-09-05 07:04:23 refreshLinks Special:Badtitle/Wikibase\Client\Changes\WikiPageUpdater pages=array(50) rootJobTimestamp=20170826142341 requestId=34338ba1225a45c9867136ae (uuid=71290d0f9a7d44818f7871f992df7c47,timestamp=1503757432,QueuePartition=rdb1-6380) STARTING
2017-09-05 07:05:46 refreshLinks Special:Badtitle/Wikibase\Client\Changes\WikiPageUpdater pages=array(50) rootJobTimestamp=20170826142341 requestId=34338ba1225a45c9867136ae (uuid=71290d0f9a7d44818f7871f992df7c47,timestamp=1503757432,QueuePartition=rdb1-6380) t=82755 good

Now the fact it's two days we're working on the same root job, and the fact that a single leaf job for this takes 83 seconds when launched with --nothrottle is very bad, IMHO.

Why do we keep running jobs that are 11 days old? Does it even make sense? Why do these jobs take so long to execute?

@aaron do you have any suggestions about what's going on there?

Change 375819 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

aaron added a comment.Tue, Sep 5, 8:12 PM

Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 50 pages/job) than the regular ones from MediaWiki core. That is a bit on the slow side for a run time of a non-rare job type (e.g. TMH or GWT).

Emijrp removed a subscriber: Emijrp.Tue, Sep 5, 8:15 PM
Joe added a comment.Wed, Sep 6, 9:08 AM

Those refreshLInks jobs (from wikibase) are the only ones that use multiple titles per job, so they will be a lot slower (seems to be 50 pages/job) than the regular ones from MediaWiki core. That is a bit on the slow side for a run time of a non-rare job type (e.g. TMH or GWT).

Even with three unthrottled jobs running on terbium, I can't seem to reduce the backlog of refreshLinks jobs on commons. This reminds me of past cases where refreshLinks jobs seem to be processed over and over. However, I ran @aaron's sanity check script on these jobs and in no case I found any job executed more than 3 times, so it just seems to be the case of a huge amount of jobs submitted and that need processing.

This at least shows we need way more capacity to handle large refreshlinks jobs, or that we might need to change the way wikidata links get updated in some way.

In general, I think throttling is important for jobs that impact external entities greatly, typically the database, not for parsing-heavy jobs that are mostly cpu-bound.

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

Yep! Very true. When I first read about Borg and whenever I read about similar principles since, the job queue always comes to mind as a great use case. Of course it would benefit app server maintenance too, but the job queue pressure tends to vary more than app server pressure. A Borg-like system would allow us to make the most of the idle time on all (app) servers and gracefully fill it up with job runners.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database). But the idea is still very attractive. Permanently setting up more job runners remains a difficult calculation for us, because in the end we must prioritise app servers for site availability. On the other hand, given how idle most app servers are most of the time, it seems like a royal waste to not put it to use.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database).

If I could add to the ideal scenario, the jobqueue would have dedicated slaves AND would write with a different domain id (allowing parallelism) than the rest of the writes so we effectively have multiply database resources with potential elasticity.

Joe added a comment.Wed, Sep 6, 3:06 PM

As a side comment: this is one of the cases where I would've loved to have an elastic environment to run MediaWiki-related applications: I could've spun up 10 instances of jobrunner dedicated to refreshlinks (or, ideally, the system could have done it automagically), for example.

Yep! Very true. When I first read about Borg and whenever I read about similar principles since, the job queue always comes to mind as a great use case. Of course it would benefit app server maintenance too, but the job queue pressure tends to vary more than app server pressure. A Borg-like system would allow us to make the most of the idle time on all (app) servers and gracefully fill it up with job runners.

Of course, that doesn't apply to cases that are limited by a common resource (e.g. database). But the idea is still very attractive. Permanently setting up more job runners remains a difficult calculation for us, because in the end we must prioritise app servers for site availability. On the other hand, given how idle most app servers are most of the time, it seems like a royal waste to not put it to use.

The waste of resources we have in general on our cluster, and the pains we have with jobrunners and esp. videoscaling capacity were my prime motivators for looking into kubernetes, in fact. Even if it lacks the scheduler that borg has, it would be an enormous advantage for us.

Joe added a comment.Thu, Sep 7, 12:49 PM

I did some more number crunching on the instances of runJob.php I'm running on terbium, I found what follows:

Wikibase refreshlinks jobs might benefit from being in smaller batches, as many of those are taking a long time to execute. Out of 33.4k wikibase jobs, we had the following distribution of timings:

oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 50000) print res[2] }' | wc -l                        
3418
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 30000) print res[2] }' | wc -l                        
10814
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 20000) print res[2] }' | wc -l                        
13430
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 10000) print res[2] }' | wc -l                        
16949
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 5000) print res[2] }' | wc -l
21394

As you can see, about 10% of jobs take 50 seconds or higher to execute, and about 64% take more than 5 seconds to execute, while I think it should be expected to have 99% of the jobs execute within 5 seconds. Also, I could see these jobs easily exceeding the maxtime of Jobrunner::Run, which is set to 30 seconds in production.

Also, there is no way with the current jobqueue to schedule jobs from wikis with more backlog than others, so for example jobrunner submitted only 250 batches/server for refreshLinks on commons yesterday, way less than the number posted for itwiki (around 350/400 per server), whose queue is 4k elements long right now, versus the 680k elements in the queue for commonswiki. This resulted in a total of 37.1k refreshLinks jobs for commonswiki being executed by the jobrunner infrastructure in a full day.

For comparison, my three threads on terbium completed a total of 35.7k jobs in the same interval.

It is pretty clear that unless we have a better scheduler or manual ways to control the jobqueue processing priority, there is no way we can recover a 700k items lag anytime soon.

I'll let my threads work to fight this specific fire, but either we fix things in the jobqueue, or we expect such things to happen until we have fully migrated to the new change-propagation backed transport that should make some of these controls easier to work with.

Could, at least, that part have something to do with T164173, as a problem from the same cause, or a consequence of the fix? I also remember some tunning of some wikidata crons or job size, but not sure if upwards or downwards and not sure if related. Aaron, Ladsgroup or Daniel may know if that could have impacted positively or negatively or can have further actionables.

One thing I mentioned architecture-wise is that "change tracking" (for purges, invalidations, references, shared data refresh, etc.) could be, to some extent, functionally separated from the rest of the real database metadata in some cases (rather than sharing resources with it).

Wikibase refreshlinks jobs might benefit from being in smaller batches

+1 on this. As we have now all jobs being emitted to EventBus as well, we have had Kafka reject a portion of the jobs because they were larger than 4MB each. Upon inspection, we saw that these were endless lists of titles sent as one refreshLinks job by Wikibase.

I made the batch smaller from 100 to 50 and I can do it to 20. Let me make a patch.

Change 376562 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

This patch can go in when commons is on wmf.17. Sooner, it's useless. (See T174422: Make dbBatchSize in WikiPageUpdater configurable)

Change 377046 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 376562 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

Change 377458 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Change 377458 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Mentioned in SAL (#wikimedia-operations) [2017-09-12T13:12:13Z] <hashar@tin> Synchronized wmf-config/Wikibase-production.php: Reduce wikiPageUpdaterDbBatchSize to 20 - T173710 (duration: 00m 45s)

Change 377811 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Harej removed a subscriber: Harej.Wed, Sep 13, 6:17 PM
Tbayer added a subscriber: Tbayer.Wed, Sep 13, 9:09 PM

Change 378719 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Change 375819 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

Joe added a comment.Tue, Sep 19, 12:55 PM

FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurrency of those jobs.

YOUR1 added a subscriber: YOUR1.Wed, Sep 20, 7:18 AM

Change 377046 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 378719 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Delays with pushing updates into search could potentially be related to the job queue. More than 12 hours is pretty exceptional for processing these, but the refreshLinks job has to run and on completion that triggers the search index update jobs. refresh links is one of the ones that we've been seeing backup from time to time.