Page MenuHomePhabricator

labvirt1011 periodically unavailable
Closed, ResolvedPublic

Description

I have a rough timeline and @Andrew can fill in the rest.

Sometime on 7/5/2016 labvirt1011 began not responding to icinga or admins. We were able to get on console but nothing seemed wrong performance wise . Puppet runs failed with failures to run gethostbyname() type messages pretty consistently. ssh as user or root would sometimes succeed for brief periods, and sometimes fail. It seems to fail as if SSH is presenting a secondary host key or is generally unresponsive. We surmised at the time there was possibly a networking issue, but a reboot seemed to resolve the issue. Today around a day later the server started exhibiting the same behavior.

  • ssh sessions fail or are too short lived to do anything
  • icinga can intermittently contact the server
  • getting on console shows puppet failing to run
  • no other performance issues that I can see in top etc
  • outbound connections initiated from the host (e.g. from the console) suffer no interruption
  • traffic on eth1 (used for instance traffic) seems unaffected

Note that each datapoint needs to be taken with a grain of salt since everything is intermittent.

Event Timeline

paste from @Andrew in chat https://phabricator.wikimedia.org/P3353

I noted:

[ 893.986233] init: nova-compute main process (6597) killed by KILL signal

and nova-compute was stopped. I started it without issue but not sure why it was killed.

more background: I did a dist-upgrade on that system right before putting it into service. That was on 2016-06-26. The system behaved well until 2016-06-05 when alarms started firing all over the place.

The alarms starting firing right when we were in the middle of setting up networking for labvirt1012, 1013, 1014.

This is almost certainly fixed by

https://gerrit.wikimedia.org/r/#/c/297783/

we'll know soon enough.

Andrew claimed this task.

So here's the story:

  • A typo in dhcpd cofig which resulted in 1012 1013 and 1014 wanting the same IP as 1011
  • This shouldn't have mattered since those boxes didn't have an OS installed yet. But, when I was trying to set up the RAIDs for those boxes, the bios-launched RAID tool somehow did a dhcp lookup and grabbed the IP anyway.
  • Result: four boxes grabbing after the same IP. This probably meant that they were all four broken, but of course no one would notice or care if 1012-1014 were broken.

Now the typo is fixed and 1012-1014 are shut down, so all is well. As soon as I start working on 1012 again I'll keep an ear out for alarms, but things /should/ be resolved thanks to the typo fix.

Thanks to faidon for almost instantly diagnosing this!