Page MenuHomePhabricator

Errors for ifup@ens5.service after rebooting Ganeti VMs
Open, LowPublic

Description

I noticed that sometimes when we (Analytics) reboot our Ganeti VMs we get the following service error:

elukey@kafka-test1006:~$ sudo journalctl -u ifup@ens5.service
-- Logs begin at Tue 2021-01-26 16:40:37 UTC, end at Wed 2021-01-27 07:27:28 UTC. --
Jan 26 16:40:38 kafka-test1006 systemd[1]: Started ifup for ens5.
Jan 26 16:40:38 kafka-test1006 sh[487]: RTNETLINK answers: File exists
Jan 26 16:40:38 kafka-test1006 sh[487]: ifup: failed to bring up ens5
Jan 26 16:40:38 kafka-test1006 systemd[1]: ifup@ens5.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 16:40:38 kafka-test1006 systemd[1]: ifup@ens5.service: Failed with result 'exit-code'.

The interface is up and running fine:

elukey@kafka-test1006:~$ ip link show ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether aa:00:00:d2:81:d8 brd ff:ff:ff:ff:ff:ff

elukey@kafka-test1006:~$ ip addr
[..]
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether aa:00:00:d2:81:d8 brd ff:ff:ff:ff:ff:ff
    inet 10.64.16.146/22 brd 10.64.19.255 scope global ens5
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:102:10:64:16:146/64 scope global dynamic mngtmpaddr 
       valid_lft 2591992sec preferred_lft 604792sec
    inet6 fe80::a800:ff:fed2:81d8/64 scope link 
       valid_lft forever preferred_lft forever

Restarting the service doesn't work (same error), the only thing that I found working (to clear the error) is ip addr flush followed by either a restart of the service or ifup ens5. Any idea why this happening?

I left kafka-test1006 unfixed to let people check :)

Event Timeline

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

@akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue..

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

Yeah, it's definitely a race of sort. I've seen it a few times times before, maybe 1 on 50 reboots or less.

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

Yeah, it's definitely a race of sort. I've seen it a few times times before, maybe 1 on 50 reboots or less.

Do you by any chance remember if it was on VMs only? Or was it physical hosts too?

I recall VMs only from my past experience, I encountered this problem a couple of times before this one.

Do you by any chance remember if it was on VMs only? Or was it physical hosts too?

From my memory only VMs. I've checked my IRC logs for cases where people pinged me or where it was logged in SAL and it only returns VMs as well (puppetdb1002, idp1001, air-airflow1001, now the schema host)

It's most probably some long-standing technical debt in ifupdown which we're seeing on VMs because virtio interfaces are quicker than real hardware. It seems unlikely we'll find a fix here (other than migrating to systemd-networkd in the future) and there's no harm other than alert noise. As such, my proposal would be to add a class similar to toil::systemd_scope_cleanup to all VMs which simply cleans this up automatically?

I 'll take your word for it. +1 on the cleanup thing.

Cleaned up with sudo ip addr flush ens5; sudo systemctl restart ifup@ens5 in tmux kafka-test1010, kafka-test1009 and kafka-test1007

Mentioned in SAL (#wikimedia-operations) [2021-02-04T08:22:03Z] <moritzm> reset failed ifup@ens5 on xhgui2001/xhgui1001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-19T13:41:40Z] <godog> reset-failed ifup@ens13 on prometheus5001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-22T13:31:52Z] <godog> reset-failed ifup@ens14 on prometheus3001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-25T11:22:46Z] <moritzm> reset-failed ifup@ens5.service on otrs1001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-06-11T05:47:17Z] <elukey> run systemctl reset-failed ifup@en5.service on doh1001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-07-13T06:53:05Z] <elukey> systemctl reset-failed ifup@ens5 on gitlab2001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-09-04T09:00:31Z] <elukey> systemctl reset-failed ifup@ens6.service on puppetdb2002 - T273026