Page MenuHomePhabricator

Rethink pacing the cirrusSearchCheckerJob
Closed, ResolvedPublic

Description

The method used for controlling the rate of the cirrusSearchCheckerJob is fairly complex and was designed having the Redis queue in mind, however, after switching to the kafka-based queue it might not really be needed any more and the job can be simplified.

Here's how in my understanding the job is working right now: the job uses the feature of delayed execution, and every 2 hours cron script is running foreachwiki for the sanitizer maintenance script. The Sanitizer script is taking a chunk of wiki pages for the wiki, spreads the delay time into 2 hours and posts the jobs. In my understanding this was done because there was no way of controlling GLOBAL cross-wiki concurrency of the job execution in the old queue - so in order to spread the load, we needed this approach.

In the new queue, we only have a way to control global, cross-wiki concurrency. So, just by finding a good number for concurrency we will be able to spread the load into the 2-hour period much easier.

Additionally, the current approach creates issues for Kafka model - imagine a number of small wikis going one after another in the foreachwiki. Wiki 1 will post a hundred jobs spread into 2 hour period, then wiki 2 posts the jobs. Although the job delays are sorted within a wiki, on a border from wiki-to-wiki they reset back into the beginning of the 2 hour time window. because of FIFO model of Kafka, this creates a very uneven load. At the beginning of the 2-hour window, we execute them according to the delay, but the closer we get to the end of the period the spikier the execution rate becomes, and the only thing that saves us from getting huge spikes in rate is change-prop concurrency control.

I propose to make an experiment to see, how will standard change-prop concurrency control work for this job. In order to do so, we just need to disable delayed execution support in ChangeProp for this particular job and set the concurrency to some reasonable number. According to Kafka, we post about 120000 jobs within the 2-hour period, meaning we need to execute about 17 jobs/s, with median exec time of the checker job 600ms we need to set concurrency around 10 to get even load distribution.

@EBernhardson @dcausse What do you think about this experiment?

Event Timeline

I'm perfectly OK to drop delayed execution for this job, it worked more or less with redis but now it creates sawtooth graph and adds unnecessary complexity to ChangeProp.
Also the 120000 jobs may vary depending on some stats stored in elastic itself (e.g. we do not allow a wiki to loop again in less than 2 weeks).
It seems that we just restarted a loop, so we should be at the point where we need the highest rate (it usually last for 2/3 days and then small wikis will start to go idle).

I'm totally fine starting at 17jobs/s and adjust it gradually.

Ok, cool @dcausse. We need some small modifications in ChangeProp to be able to disable delayed execution, I think on Monday we can deploy it and then wait for some time and see how the graphs/stats look. I don't think this can be dangerous in any way

Change 443072 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Disable delayed execution for cirrusSearchCheckerJob.

https://gerrit.wikimedia.org/r/443072

Change 443072 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Disable delayed execution for cirrusSearchCheckerJob.

https://gerrit.wikimedia.org/r/443072

Mentioned in SAL (#wikimedia-operations) [2018-07-02T08:59:57Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@dfdd362]: Disable delayed execution for cirrusSearchCheckerJob T198462

Mentioned in SAL (#wikimedia-operations) [2018-07-02T09:00:47Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@dfdd362]: Disable delayed execution for cirrusSearchCheckerJob T198462 (duration: 00m 50s)

The proposed change deployed, let's wait for some time before making any judgements on the effect it's made.

Change 443586 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Decrease checker concurrency to 6.

https://gerrit.wikimedia.org/r/443586

Change 443586 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Decrease checker concurrency to 6.

https://gerrit.wikimedia.org/r/443586

Mentioned in SAL (#wikimedia-operations) [2018-07-03T10:12:17Z] <ppchelko@deploy1001> Started deploy [cpjobqueue/deploy@ba672a3]: Decrease checker job concurrency T198462

Mentioned in SAL (#wikimedia-operations) [2018-07-03T10:13:09Z] <ppchelko@deploy1001> Finished deploy [cpjobqueue/deploy@ba672a3]: Decrease checker job concurrency T198462 (duration: 00m 52s)

EBjune triaged this task as Medium priority.Jul 5 2018, 5:07 PM
EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.

Change 450150 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase checker concurrency to 10.

https://gerrit.wikimedia.org/r/450150

Change 450150 merged by Mobrovac:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Increase checker concurrency to 10.

https://gerrit.wikimedia.org/r/450150

Mentioned in SAL (#wikimedia-operations) [2018-08-06T09:32:21Z] <mobrovac@deploy1001> Started deploy [cpjobqueue/deploy@62716a5]: CirrusSearch jobs: Increase checker concurrency to 10 - T198462

Mentioned in SAL (#wikimedia-operations) [2018-08-06T09:33:05Z] <mobrovac@deploy1001> Finished deploy [cpjobqueue/deploy@62716a5]: CirrusSearch jobs: Increase checker concurrency to 10 - T198462 (duration: 00m 44s)

mobrovac assigned this task to Pchelolo.
mobrovac subscribed.

Things have been stable for a while with the current concurrency and execution rate limiting, so we'll keep it as-is for now.