Page MenuHomePhabricator

releases2002 ganeti VM not getting IP after reboot
Closed, ResolvedPublic

Description

I rebooted releases2002, a ganeti VM.

It did not come back from reboot. I connected to console and saw it was up though, logged in with root password.

Then saw ferm failed to start and no iptables rules. When trying to start ferm.. failure to do a DNS lookup.

Then saw.. it did not get any IP on the interface at all and networking is just down .. rebooted it one more time.. no change.

still have to find out why what is happening here.. until then we should be careful rebooting any ganeti VMs.. maybe it is a global issue

Event Timeline

I reached this through alerts of backups of releases2002 not working since 2021-01-21. I will disable alerts for this host until fixed (please ping me to re enabling them when fixed).

Change 657783 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] bacula: Ignore releases2002 backup errors until vm issues are fixed

https://gerrit.wikimedia.org/r/657783

Change 657783 merged by Jcrespo:
[operations/puppet@production] bacula: Ignore releases2002 backup errors until vm issues are fixed

https://gerrit.wikimedia.org/r/657783

I think the following explains it:

root@releases2002:~# ip addr ls
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether aa:00:00:c1:bf:5d brd ff:ff:ff:ff:ff:ff

and

root@releases2002:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
allow-hotplug ens5
iface ens5 inet static
	address 10.192.16.180/22
	gateway 10.192.16.1
	# dns-* options are implemented by the resolvconf package, if installed
	dns-nameservers 10.3.0.1
	dns-search codfw.wmnet
   pre-up /sbin/ip token set ::10:192:16:180 dev ens5
   up ip addr add 2620:0:860:102:10:192:16:180/64 dev ens5

So, somehow this PCI slot for the network card changed, thus renaming the card cause systemd implements now persistent interface devices names. [1]

Now as to why, my guess is: https://phabricator.wikimedia.org/T272092#6762844 which

akosiaris@releases2002:~$ sudo lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:06.0 Ethernet controller: Red Hat, Inc Virtio network device

points out.

[1] https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html

akosiaris claimed this task.

Anyway, s/ens5/ens6/ in /etc/network/interfaces and the issue has been fixed. I was wondering whether it makes sense to invest time to "fix" this but having met 1 instance of it in 5-6 years that we have ganeti around, I am gonna say it's not worth it. That being said, let's document this.

Let me reopen for reenabling backup monitoring (even if main issue has been fixed).

Thanks @akosiaris for the help here.

jcrespo reassigned this task from jcrespo to akosiaris.

Backing up 0 bytes was unsurprisingly fast :-) Thanks again to both of you.

Mentioned in SAL (#wikimedia-operations) [2021-01-22T17:57:43Z] <mutante> releases1002 (releases.wm.org active backend) - rebooting - hopefully it does not run into T272555 but if it does now it's known how to fix

releases1002 had the exact same issue.. so confirmed it was caused by adding the new disk.

The same fix (ens5->ens6) also resolved it again.