Page MenuHomePhabricator

Replace kernel and reboot labvirt1015, 1016, 1017, 1018
Closed, ResolvedPublic

Description

labvirt1015-1018 are running the 4.4.0-83-generic kernel.

We have seen some REALLY BAD behavior from this kernel in labtest. Specifically, three hosts running that kernel lost all network connectivity when rebooted and had to be re-imaged. A more modern (-93) kernel didn't show this problem.

So, let's get those four hosts upgraded. 1015, 1017 and 1018 aren't currently holding any user-owned VMs, so they can be rebooted at any time. 1016 will require a scheduled and pre-announced reboot as it holds quite a lot of real, actively used VMs.

Event Timeline

Andrew triaged this task as High priority.Sep 16 2017, 3:23 PM

Change 378397 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: depool labvirt1016

https://gerrit.wikimedia.org/r/378397

Change 378397 merged by Andrew Bogott:
[operations/puppet@production] nova: depool labvirt1016

https://gerrit.wikimedia.org/r/378397

Mentioned in SAL (#wikimedia-operations) [2017-09-16T15:34:17Z] <andrewbogott> rebooting labvirt1015 for T176044

I just upgraded labvirt1015 and 1017 to -93 and rebooted, and both lost network config just like we saw with -83. So something very bad is going on here. I'm going to re-image 1015 and see where I get.

I re-imaged labvir1015 and 1017. They're now running 4.4.0-93-generic and rebooting fine. Do I know what just happened here? I do not.

Linux labvirt1001 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1002 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1003 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1004 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1005 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1006 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1007 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1008 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1009 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1010 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1011 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1012 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1013 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1014 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1015 4.4.0-93-generic #116~14.04.1-Ubuntu SMP Mon Aug 14 16:07:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1016 4.4.0-83-generic #106~14.04.1-Ubuntu SMP Mon Jun 26 18:10:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1017 4.4.0-93-generic #116~14.04.1-Ubuntu SMP Mon Aug 14 16:07:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Linux labvirt1018 4.4.0-83-generic #106~14.04.1-Ubuntu SMP Mon Jun 26 18:10:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Should we be sticking at 4.4.0-81-generic for 15,16,17,18?

Any errors shown for the non-working ones?

@chasemp mentioned this odd issue at the meeting today. If there are no (useful?) logs, are there perhaps any hosts that exhibit the non-working behavior or can be easily triggered to? Let me know (here or on IRC) if you reboot and get the broken behavior, and I can attempt to debug or gather more information from the live (broken) system.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance? If it's the latter I'm wondering whether this might be related to br_netfilter not being loaded in time (see the change that was made in e56e21857c64). Maybe we're hitting a race that we haven't run into before.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?

It was the host itself. For the most part these systems aren't hosting any VMs.

root@labtestvirt2002:~# ifconfig up eth0
eth0: Host name lookup failure
ifconfig: `--help' gives usage information.

I'll see if I can make it happen on labvirt1018.

Was network connectivity lost to the server at large or to the VMs running on that labvirt instance?

It was the host itself. For the most part these systems aren't hosting any VMs.

root@labtestvirt2002:~# ifconfig up eth0
eth0: Host name lookup failure
ifconfig: `--help' gives usage information.

That's because you typed ifconfig up eth0, not ifconfig eth0 up, so that's a legitimate unrelated error, FWIW :)

A summary from the IRC conversation I had with @RobH on 2017-09-12

chasemp: I rebooted labtestvirt2001 and it never came back for SSH. So I got on the console.

ip link show

shows only the lo

lshw -class network

shows nics eth0 and eth1 that are 'unclaimed'

<chasemp> labtestvirt2002 now did the same thing.

<robh> someone could have done some kind of kernel update and not rebooted and caused it maybe

<chasemp> I had a moment of wtf and rebooted labvirt1018 in prod just to see if it would come back and it did


<robh> so yeah... i wonder if its a kernel issue
<chasemp> robh: try rebooting labtestvirt2003 see what it does
<robh> ok, its rebooting
<robh> if it comes back with a newer version, its pretty telling someone rolled update and didnt reboot on th
at small fleet
confirmed it does not come up and is broken in the same way
<robh> chasemp: Linux labtestvirt2003 4.4.0-83-generic #106~14.04.1-Ubuntu SMP Mon Jun 26 18:10:19 UTC 2017 x
86_64
<robh> so kernel change happend to them that someone didnt reboot
<chasemp> robh: what was the kernel before?
<robh> Linux labtestvirt2003 4.4.0-81-generi
<robh> but the fact it broke both hardware types seems very odd
<robh> chasemp: yep, all unclaimed under lshw -class network


<robh> so they could be the same broadcom chipset on both, its quite feasible
<robh> NetXtreme BCM5719 Gigabit Ethernet PCIe
<robh> is whats on labtestvirt2003

<robh> whats lshw -class network say the chipset is on 1018?
<chasemp> NetXtreme BCM5720 Gigabit Ethernet PCIe
<robh> slightly different
<robh> is labtestvirt2001 back up?
<chasemp> labtestvirt2001 is NetXtreme BCM5720 Gigabit Ethernet PCI
<chasemp> huh
<robh> well, bye bye theory
<robh> so same chipset on labtestvirt2001 and labvirt1018


<robh> chasemp: ok, so if these are fully test boxes and we dont care about data
<robh> wanna try reimaging one fresh?
<chasemp> sure, labtestvirt2003 is best candiate it's not yet taking instances at all
<robh> cool, lemme reimage it
<robh> chasemp: rebooted fine post install, pre-puppet run

I've moved labvirt1018 to 4.4.0-83 but can't reproduce this issue.

I've rebuilt labvirt1015, 1017 and 1018 (and the labtestvirts) with 4.4.0-81. So now all of our virt nodes are running that kernel except for 1016, which needs an evacuation before I mess with it.

Change 382213 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: add labvirt1017 to the scheduling pool

https://gerrit.wikimedia.org/r/382213

Change 382213 merged by Andrew Bogott:
[operations/puppet@production] nova: add labvirt1017 to the scheduling pool

https://gerrit.wikimedia.org/r/382213

Every labvirt is now running Linux labvirt1008 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux