The method used for controlling the rate of the cirrusSearchCheckerJob is fairly complex and was designed having the Redis queue in mind, however, after switching to the kafka-based queue it might not really be needed any more and the job can be simplified.
Here's how in my understanding the job is working right now: the job uses the feature of delayed execution, and every 2 hours cron script is running foreachwiki for the sanitizer maintenance script. The Sanitizer script is taking a chunk of wiki pages for the wiki, spreads the delay time into 2 hours and posts the jobs. In my understanding this was done because there was no way of controlling GLOBAL cross-wiki concurrency of the job execution in the old queue - so in order to spread the load, we needed this approach.
In the new queue, we only have a way to control global, cross-wiki concurrency. So, just by finding a good number for concurrency we will be able to spread the load into the 2-hour period much easier.
Additionally, the current approach creates issues for Kafka model - imagine a number of small wikis going one after another in the foreachwiki. Wiki 1 will post a hundred jobs spread into 2 hour period, then wiki 2 posts the jobs. Although the job delays are sorted within a wiki, on a border from wiki-to-wiki they reset back into the beginning of the 2 hour time window. because of FIFO model of Kafka, this creates a very uneven load. At the beginning of the 2-hour window, we execute them according to the delay, but the closer we get to the end of the period the spikier the execution rate becomes, and the only thing that saves us from getting huge spikes in rate is change-prop concurrency control.
I propose to make an experiment to see, how will standard change-prop concurrency control work for this job. In order to do so, we just need to disable delayed execution support in ChangeProp for this particular job and set the concurrency to some reasonable number. According to Kafka, we post about 120000 jobs within the 2-hour period, meaning we need to execute about 17 jobs/s, with median exec time of the checker job 600ms we need to set concurrency around 10 to get even load distribution.
@EBernhardson @dcausse What do you think about this experiment?