On 03/25 around between 11:21 and 11:33 there was an increase in htmlCacheUpdate job concurrency from 2.5 up to 7.5 jobs. This was caused by an edit on enwiki for the page https://en.wikipedia.org/wiki/Module:Language/data/iana_scripts which caused a long sequence of recursive updates. Given that the batchSize is 300 for the htmlCacheUpdate job, this overloaded MySQL replication.
In order to aviod that we need to decrease the htmlCacheUpdate job concurrency. However, it's better to also partition the htmlCacheUpdate topic according to MySQL replicas just like we do for refreshLinks job. Given that we have 8 partitions and current overall concurrency is 10, 2 concurrent jobs for htmlCacheUpdate for each partition should be enough.
Prior to deploying the change the existing topics for htmlCacheUpdate must be edited manually to add partitions.