root@cloudmetrics1003:~# uptime 17:44:41 up 13:58, 3 users, load average: 230.34, 229.37, 226.33
This happens reliably anytime I make 1003 the primary cloudmetrics host. It seems not to happen when I make 1003 the secondary.
Here's an approximate graph of the problem appearing:
Note that those metrics are way off because the system quickly becomes too sick to reliably update metrics.
It's especially mysterious that this happens 'under load' since the main thing this host is doing is aggregating prometheus metrics, which it is also doing as the secondary.
There's a periodic rsync from the primary to the secondary which seems to trigger the problem, but that would mean that disk /reads/ causes the issue which seems very unlikely.