I have a rough timeline and @Andrew can fill in the rest.
Sometime on 7/5/2016 labvirt1011 began not responding to icinga or admins. We were able to get on console but nothing seemed wrong performance wise . Puppet runs failed with failures to run gethostbyname() type messages pretty consistently. ssh as user or root would sometimes succeed for brief periods, and sometimes fail. It seems to fail as if SSH is presenting a secondary host key or is generally unresponsive. We surmised at the time there was possibly a networking issue, but a reboot seemed to resolve the issue. Today around a day later the server started exhibiting the same behavior.
- ssh sessions fail or are too short lived to do anything
- icinga can intermittently contact the server
- getting on console shows puppet failing to run
- no other performance issues that I can see in top etc
- outbound connections initiated from the host (e.g. from the console) suffer no interruption
- traffic on eth1 (used for instance traffic) seems unaffected
Note that each datapoint needs to be taken with a grain of salt since everything is intermittent.