Page MenuHomePhabricator

Migrate htmlCacheUpdate job to Kafka
Closed, ResolvedPublic

Description

After initial migration of several simple jobs and some high-traffic, but still simple jobs, we think we're ready for htmlCacheUpdate job.

That job uses deduplication extensively and has root/leaf jobs and recursion, so it will be examining change-prop functionality that has only been tested for RESTBase update events, but not for jobs. The mechanism is absolutely the same, so we can be fairly confident it will work.

The issue with htmlCacheUpdate is that we cannot double-process it for any significantly long period of time. If we double-process a big root job we will double to number of leaf jobs, but the real issue is that if the job is recursive and both will post a recurring job - both recurring jobs will get double-processed and create 4 times the leaf jobs then normal, then 8, 16 etc, so it has the potential to grow to an astronomical number of duplication. This should be stopped by deduplication, but as we did not yet battle-test it in production with the job queue - it has the potential to explode.

If we switch off all runners for htmlCacheUpdate instantly, we will skip any backlog that the Reds-based queue could have had on those jobs.

So the only option is to switch off production of the jobs to Redis immediately after enabling the jobs in the Kafka-based queue and do that for a single wiki for some testing period.

Worst case scenario if this job would somehow be completely broken in the Kafka-based queue, we always have the log of the jobs in the Kafka topics, so we can reprocess them.

I'm wondering which wiki would be a good test. I propose wiktionary because they use a lot of templates and a lot of highly-used templates that will help testing the deduplication a lot.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

A few things to note:

  • htmlCacheUpdate job frequency varies a lot between wikis. Even a moderately large wiki like dewiki can have relatively few updates compared to the heavvy wikidata users like commons or ruwiki
  • these jobs are the only one that are currently throttled for concurrency. I think we allow a maximum concurrency of 400 jobs/s at the moment, so we should tune the concurrency across the two systems during the transition
  • I think it's ok to use a small wiki to test the functionality without touching the concurrency configs, though. Which wiktionary did you have in mind?

Set a deployment window for the migration for 2017-12-06 17:30 UTC.

I think it's ok to use a small wiki to test the functionality without touching the concurrency configs, though. Which wiktionary did you have in mind?

I've estimated the number of htmlCacheUpdate jobs for wiktionaries compared to all and the proportion is very very low - only about 0.1% belong to wiktionaries, so I guess we could with all of them at once. And I did inspect the jobs and my theory was correct - there are some pretty long recursion chains for wiktionary jobs.

Change 395615 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate jobs for wiktionary

https://gerrit.wikimedia.org/r/395615

Change 395616 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable producing htmlCacheUpdate to redis for wiktionaries

https://gerrit.wikimedia.org/r/395616

Change 395615 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate jobs for wiktionary

https://gerrit.wikimedia.org/r/395615

Mentioned in SAL (#wikimedia-operations) [2017-12-06T17:42:01Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@3281df1]: Switch htmlCacheUpdate for wiktionaries T182023

Mentioned in SAL (#wikimedia-operations) [2017-12-06T17:44:58Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@3281df1]: Switch htmlCacheUpdate for wiktionaries T182023 (duration: 02m 57s)

Mentioned in SAL (#wikimedia-operations) [2017-12-06T17:50:58Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@df72b34]: Switch htmlCacheUpdate for wiktionaries, attempt 2 T182023

Mentioned in SAL (#wikimedia-operations) [2017-12-06T17:51:30Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@df72b34]: Switch htmlCacheUpdate for wiktionaries, attempt 2 T182023 (duration: 00m 32s)

Change 395616 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable producing htmlCacheUpdate to redis for wiktionaries

https://gerrit.wikimedia.org/r/395616

Mentioned in SAL (#wikimedia-operations) [2017-12-06T18:12:10Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: Switch htmlCacheUpdate jobs for wiktionaries to EventBus, file 1/2 - T182023 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2017-12-06T18:13:19Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: Switch htmlCacheUpdate jobs for wiktionaries to EventBus, file 2/2 - T182023 (duration: 00m 48s)

For the reference, next time we migrate recursive jobs we need to switch off Redis queue production before switching on Kafka consumption. The scenario I was describing with recursive jobs being exponentially cloned have played and we've got unlucky enough to create quite a lot of clones.

There were 2 reasons the reduplication didn't work:

  1. We've had a bug in it that is now fixed and deployed https://github.com/wikimedia/change-propagation/pull/217
  2. htmlCacheUpdate jobs set removeDuplicates to false for range jobs. So even if we didn't have a bug, reduplication wouldn't have worked as we've expected it to work.

I've been monitoring the situation and it didn't really create any issues downstream (on runners/mysql level), and I could clearly see that we were progressing through the clones clearing the backlog, so I decided not to rollback anything but wait till the clones get cleared up naturally.

Lessons learned:

  1. Switch off old one before switching on new one. This exponential cloning can explode quite quickly
  2. With CP concurrency = 30 the actual job rate is maxed on 200/s because at least for wiktionary the job is fairly quick
  3. The interleaving of different wikis work as we've expected. A very long-tail job from a wiki does delay smaller jobs from other wikis, but currently the average delay is 3 seconds, even though we have tens of clones with really long tails (almost the whole en wiktionary and fr wiktionary)
  4. The backlog in messages doesn't really say a lot about the queue healf for recursive jobs - it's being ~1000. What would be a really interesting metric, at least for htmlCache is how far did the start parameter progressed compared to the end of the range. Having that metric (at least for htmlCacheUpdate) would've been insanely useful. I've made some half-backed scripts to get it though,

The backlog was cleared now, all seems in good shape.

After a day of running the jobs for wiktionaries I don't see any issues at all, but on the contrary I don't really see any deduplication - everything happens so fast that jobs just don't get to be superseded by another root job. Also during yesterday's backlog processing we battle tested concurrency limits and load, so I propose to go a bit further and enable for something more high-traffic.

As @Joe suggested, we can enable ceb.wikipedia.org, but in my estimation it accounts for only about 2% of the jobs. Russian wikipedia is not as heavy either - 2.3% of the jobs.

The largest users are Wikidata (25%), commons (13%), enwiki (12%), pt and fa wiki (10% each). There's a bunch of big wikis that average around 2-5%. So, overall all wikipedias are 56% of jobs, Wikidata is 25%, commons at 14%, wiktionary that we already switched has 1.9% - that leaves only 3% to all the other projects.

So, I'm not sure how to proceed here before the holidays. If not holidays I'd propose to be bold and switch everything except the 5 biggest ones giving the new queue 40% of traffic, but given the holidays I'd downscale and propose to switch all non-wikipedia small project + all wikipedias < 1% - that should give us ~10% of traffic.

Switching all small non-WP projects should be a non-brainer, so I'd vote for switching them + cebwiki and ruwiki. This should be safe enough given that (i) the emission of these jobs overall is rather stable (~50 jobs/s); (ii) we are currently processing less than 1 job/s; and (iii) monitoring and inspecting these jobs over the no-deployments period will give us enough confidence to start switching them more aggressively afterwards.

Change 397581 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] Disable Redis queue for small projects and ru and ceb wiki

https://gerrit.wikimedia.org/r/397581

Change 397585 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate for ceb and ru wiki plus all small projects

https://gerrit.wikimedia.org/r/397585

Change 397585 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate for ceb and ru wiki plus all small projects

https://gerrit.wikimedia.org/r/397585

Change 397581 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable Redis queue for small projects and ru and ceb wiki

https://gerrit.wikimedia.org/r/397581

Mentioned in SAL (#wikimedia-operations) [2017-12-11T18:54:41Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@e1075af]: Enable htmlCacheUpdate for ceb and ru wiki and small projects T182023

Mentioned in SAL (#wikimedia-operations) [2017-12-11T18:55:15Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@e1075af]: Enable htmlCacheUpdate for ceb and ru wiki and small projects T182023 (duration: 00m 34s)

Mentioned in SAL (#wikimedia-operations) [2017-12-11T18:56:46Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: Switch cebwiki, ruwiki and small projects to Kafka for htmlCacheUpdate - T182023 (duration: 00m 57s)

Switching all small non-WP projects should be a non-brainer, so I'd vote for switching them + cebwiki and ruwiki.

These domains have been switched and we now observe around 5 jobs being processed per second.

I've looked over all the logs and graphs we've acquired over the past several weeks and found no indications of issues. This makes me believe it should be safe to switch more jobs. However we should be conservative regarding the load, so I think we should aim at 50% of all jobs on the new infrastructure this time, meaning we can switch all but Wikidata, commons and enwiki.

Change 403703 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata

https://gerrit.wikimedia.org/r/403703

Change 403704 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Config: Switch htmlCacheUpdates for all but en, commons, wikidata

https://gerrit.wikimedia.org/r/403704

Change 403704 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Config: Switch htmlCacheUpdates for all but en, commons, wikidata

https://gerrit.wikimedia.org/r/403704

Change 403703 merged by jenkins-bot:
[operations/mediawiki-config@master] JobQueue: Use EventBus for HTMLCacheUpdate except en, commons, wikidata

https://gerrit.wikimedia.org/r/403703

Mentioned in SAL (#wikimedia-operations) [2018-01-16T22:39:23Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@19b9bdd]: Switch htmlCacheUpdates for all but en, commons, wikidata T182023

Mentioned in SAL (#wikimedia-operations) [2018-01-16T22:39:58Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@19b9bdd]: Switch htmlCacheUpdates for all but en, commons, wikidata T182023 (duration: 00m 35s)

Mentioned in SAL (#wikimedia-operations) [2018-01-16T22:40:23Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: Use EventBus for htmlCacheUpdate jobs for all wikis but en, commons and wikidata - T182023 (duration: 01m 12s)

Change 404596 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate job processing for all wikis

https://gerrit.wikimedia.org/r/404596

Change 404598 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] [JobQueue] Enable htmlCacheUpdate on new infrastructure for all projects.

https://gerrit.wikimedia.org/r/404598

Change 404596 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] [Config] Enable htmlCacheUpdate job processing for all wikis

https://gerrit.wikimedia.org/r/404596

Change 404598 merged by jenkins-bot:
[operations/mediawiki-config@master] [JobQueue] Enable htmlCacheUpdate on new infrastructure for all projects.

https://gerrit.wikimedia.org/r/404598

Mentioned in SAL (#wikimedia-operations) [2018-02-05T21:45:09Z] <ppchelko@tin> Started deploy [cpjobqueue/deploy@aebfded]: Enble htmlCacheUpdate job for all wikis T182023

Mentioned in SAL (#wikimedia-operations) [2018-02-05T21:47:35Z] <ppchelko@tin> Finished deploy [cpjobqueue/deploy@aebfded]: Enble htmlCacheUpdate job for all wikis T182023 (duration: 02m 27s)

Mentioned in SAL (#wikimedia-operations) [2018-02-05T22:45:18Z] <mobrovac@tin> Synchronized wmf-config/jobqueue.php: EventBus: Enable htmlCacheUpdate jobs for all projects - T182023 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2018-02-05T22:46:25Z] <mobrovac@tin> Synchronized wmf-config/InitialiseSettings.php: EventBus: Enable htmlCacheUpdate jobs for all projects - T182023 (duration: 00m 55s)

Seems like the migration is complete with no issues. Resolving

Change 408576 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Remove jobrunner config specific to htmlCacheUpdate.

https://gerrit.wikimedia.org/r/408576

Change 408576 abandoned by Ppchelko:
Remove jobrunner config specific to htmlCacheUpdate.

Reason:
Superseded by Ie9eebe1c32cf4cff938669ecd3e066e9befb3557

https://gerrit.wikimedia.org/r/408576