After initial migration of several simple jobs and some high-traffic, but still simple jobs, we think we're ready for htmlCacheUpdate job.
That job uses deduplication extensively and has root/leaf jobs and recursion, so it will be examining change-prop functionality that has only been tested for RESTBase update events, but not for jobs. The mechanism is absolutely the same, so we can be fairly confident it will work.
The issue with htmlCacheUpdate is that we cannot double-process it for any significantly long period of time. If we double-process a big root job we will double to number of leaf jobs, but the real issue is that if the job is recursive and both will post a recurring job - both recurring jobs will get double-processed and create 4 times the leaf jobs then normal, then 8, 16 etc, so it has the potential to grow to an astronomical number of duplication. This should be stopped by deduplication, but as we did not yet battle-test it in production with the job queue - it has the potential to explode.
If we switch off all runners for htmlCacheUpdate instantly, we will skip any backlog that the Reds-based queue could have had on those jobs.
So the only option is to switch off production of the jobs to Redis immediately after enabling the jobs in the Kafka-based queue and do that for a single wiki for some testing period.
Worst case scenario if this job would somehow be completely broken in the Kafka-based queue, we always have the log of the jobs in the Kafka topics, so we can reprocess them.
I'm wondering which wiki would be a good test. I propose wiktionary because they use a lot of templates and a lot of highly-used templates that will help testing the deduplication a lot.