Page MenuHomePhabricator

labtestvirt2003 does not survive reboot on normal labvirt kernel of 4.4.0-81-generic
Closed, InvalidPublic

Description

We previously consolidated on 4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux because of issues with varying kernel versions. That seemed like a safe choice, and I thought we ran through every labvirt at the time. But now I'm not so sure.

Reproduce:

ssh sarin.codfw.wmnet
screen /home/rush/wmf-auto-reimage-stop-after-puppet -d labtestvirt2003.codfw.wmnet labtestvirt2003.mgmt.codfw.wmnet

(slightly hacked version that stops before the puppet run. Wait for it to exit.)

ssh puppetmaster1001.eqiad.wmnet
install-console labtestvirt2003.codfw.wmnet
uname -a (see 3.13 kernel)
aptitude install linux-image-4.4.0-81-generic
passwd

(set the root password now because w/o a puppet run to set it and no network this will come back blind)

/sbin/reboot

It will be dead except via console on 4.4

login via console
ip link show

(has no network interfaces)

aptitude remove --purge linux-image-4.4.0-81-generic

(removes 4.4 after which it will reboot into 3.13 and be back on the network)

https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0#Common_Actions

Event Timeline

chasemp updated the task description. (Show Details)

I rebooted labtestvirt2002 which is on 4.4.0-81-generic and it came back fine. However, racktables says this is totally and completely different hardware. Labtestvirt2003 is actually fairly new T166237.

Handing to @Andrew to follow in my footsteps to repo so as to prove I'm not crazy :)

From @Andrew on IRC

[16:19] chasemp: I can confirm that that kernel breaks networking.  I briefly thought that maybe our other hosts were downgraded to that kernel in some incomplete way so tried upgrading it to the latest Xenial kernel and then downgrading but it still broke.
[16:48] chasemp: in case you want to pretend today never happened I'm going to leave that host with a working (but different) 4.4 kernel so you can just forge ahead.  I'm letting puppet do its thing now.

which is...

rush@labtestvirt2003>uname -a
Linux labtestvirt2003 4.4.0-104-generic #127~14.04.1-Ubuntu SMP Mon Dec 11 12:44:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Where things get weird. Where linux-image-4.4.0-104-generic seems fine on labtestvirt2003 it has a similar no network effect on labtestvirt2002 seemingly so far.

aptitude install linux-image-4.4.0-81-generic; aptitude install linux-image-extra-4.4.0-81-generic

Took a step back to work on another problem for a day and the solution jumps out at me :)