Recently we've had reports where Quarry was misbehaving (T247978 T246970).
We have decided to reimage labsdb1011 to Buster and 10.4 to see if that helps the CPU usage.
We depooled labsdb1011 and placed labsdb1010 on the Analytics role (which also servers Quarry): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592516/
Before the reimage, we noticed that labsdb1011 had a huuuuuge backlog of pending InnoDB purges (https://planet.mysql.com/entry/?id=5991415):
We placed labsdb1010 to replace it temporarily, and we observed how its pending purges started to increase as soon as it took over the analytics role:
We don't know whether that endless growing purge lag was the cause of labsdb1011 slowness and crashes, but it is definitely something not good and probably not something we can afford long-term.
It seems very specific to this slow role and/or the fact that they serve Quarry, as we can see with labsdb1010.
For what is worth, the query killer on labsdb1010 is now 3600 seconds - the normal query killer on the analytics role (labsdb1011) is 14400.