Today (May 29th) since ~ noon UTC, wdqs1002 is lagging behind on replication. Updates are still happening, just not fast enough to catch up. Restarting the updater and blazegraph does not solve the issue. Logs don't seem to have anything suspicious, load on that machine is reasonable.
Looking at dmesg, there are a lot of warnings about CPU temperature and throttling:
[9098037.343804] CPU23: Package temperature above threshold, cpu clock throttled (total events = 52647618)
This has been going on at least since May 22, but seems to happen more often lately. This might or might not be related to the issue.
If it happens on a single server, not a load issue. Combined with warnings looks like hardware problem. I'll make a pass through the logs tomorrow to see if maybe it's still some software bug but so far it looks like it may need some care - maybe a fan dropped dead or something?
Taking wdqs1002 out of LVS seems to have given it sufficient breathing space to catch up on replication. I added it back and it seems stable so far. I'm still not trusting it entirely...