Intro
We have a number of VMs that showed a weird behavior at times. Symptoms were:
* Neon saying services on the host are down, but not the host * Indeed the host would ping and most networking would still work, but no SSH * Ganglia would show huge IO wait like in https://ganglia.wikimedia.org/latest/graph.php?h=alsafi.wikimedia.org&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=01%2F25%2F2016%2014%3A00&ce=01%2F26%2F2016%2000%3A00&st=1453818571&g=cpu_report&z=medium&c=Miscellaneous%20codfw * Connecting to the console via sudo gnt-instance console <vm_hostname> and hitting a single enter would fix it * Sometimes just connecting via ssh would fix it
Probable cause
KVM/QEMU bug. Very rare and seems to be VM load dependent. No way to reproduce it has been yet found. It would usually trigger in newly created VMs. VMs with any kind of load would not show this symptom hence the relatively low priority. Idling VMs were the most probable to display the problem. Talking to other people who had experienced the bug (I know exactly 2) seemed to yield a workaround. The bug has NOT being filed upstream to my knowledge, mostly due to the difficulty of reproducing it.
An effort for a workaround
Setting disk_aio to native to a couple of VMs yielded promising results. The issue had not been reproduced on them. There were no other side-effects either. Unfortunately after migrating all the hosts to use the setting, the issue is still present.
Other stuff
There is one old bug dated back to 2012 that seemed bad but does NOT apply to our case. It was sparse files being used over ext4 or xfs volumes. https://access.redhat.com/articles/40643. Fixed since then.