The jobqueue has exploded in size, starting on april 25th at around 5PM. Add to that that the job consume rate has spiked up from ~ 150 jobs/s to 1600 and you can see how this is endangering quite a few parts of the infrastructure, including breaking the redis replica for the jobqueue given the size of the databases that has exploded to over 3.8 GB in size for a single db.
At the same time, the rate of job failed/timed out has gone from ~ 0 to 220 per minute.
All of the jobs queued, as far as I can see, are cirrusSearchElasticaWrite; looking at logstash I see:
Dropping delayed ElasticaWrite job for DataSender::sendData in cluster codfw after waiting 14210s
Unless I misinterpret this, it seems the jobs fail to execute. The only slightly related item I find in the SAL is
12:55 gehel: starting elasticsearch codfw rolling restart for plugin update and NUMA config - T191543 / T191236