labvirt1011 periodically unavailable
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Jul 7 2016, 2:40 AM

Description

I have a rough timeline and @Andrew can fill in the rest.

Sometime on 7/5/2016 labvirt1011 began not responding to icinga or admins. We were able to get on console but nothing seemed wrong performance wise . Puppet runs failed with failures to run gethostbyname() type messages pretty consistently. ssh as user or root would sometimes succeed for brief periods, and sometimes fail. It seems to fail as if SSH is presenting a secondary host key or is generally unresponsive. We surmised at the time there was possibly a networking issue, but a reboot seemed to resolve the issue. Today around a day later the server started exhibiting the same behavior.

ssh sessions fail or are too short lived to do anything
icinga can intermittently contact the server
getting on console shows puppet failing to run
no other performance issues that I can see in top etc
outbound connections initiated from the host (e.g. from the console) suffer no interruption
traffic on eth1 (used for instance traffic) seems unaffected

Note that each datapoint needs to be taken with a grain of salt since everything is intermittent.

Event Timeline

• chasemp created this task.Jul 7 2016, 2:40 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 7 2016, 2:40 AM

• chasemp triaged this task as High priority.Jul 7 2016, 2:41 AM

paste from @Andrew in chat https://phabricator.wikimedia.org/P3353

I noted:

[ 893.986233] init: nova-compute main process (6597) killed by KILL signal

and nova-compute was stopped. I started it without issue but not sure why it was killed.

Andrew updated the task description. (Show Details)Jul 7 2016, 11:44 AM

Paladox subscribed.Jul 7 2016, 11:45 AM

more background: I did a dist-upgrade on that system right before putting it into service. That was on 2016-06-26. The system behaved well until 2016-06-05 when alarms started firing all over the place.

The alarms starting firing right when we were in the middle of setting up networking for labvirt1012, 1013, 1014.

This is almost certainly fixed by

https://gerrit.wikimedia.org/r/#/c/297783/

we'll know soon enough.

So here's the story:

A typo in dhcpd cofig which resulted in 1012 1013 and 1014 wanting the same IP as 1011
This shouldn't have mattered since those boxes didn't have an OS installed yet. But, when I was trying to set up the RAIDs for those boxes, the bios-launched RAID tool somehow did a dhcp lookup and grabbed the IP anyway.
Result: four boxes grabbing after the same IP. This probably meant that they were all four broken, but of course no one would notice or care if 1012-1014 were broken.

Now the typo is fixed and 1012-1014 are shut down, so all is well. As soon as I start working on 1012 again I'll keep an ear out for alarms, but things /should/ be resolved thanks to the typo fix.

Thanks to faidon for almost instantly diagnosing this!

labvirt1011 periodically unavailableClosed, ResolvedPublicActions

Description

Event Timeline

labvirt1011 periodically unavailable
Closed, ResolvedPublic
Actions