Page MenuHomePhabricator

Errors for ifup@ens5.service after rebooting Ganeti VMs
Open, LowPublic

Description

I noticed that sometimes when we (Analytics) reboot our Ganeti VMs we get the following service error:

elukey@kafka-test1006:~$ sudo journalctl -u ifup@ens5.service
-- Logs begin at Tue 2021-01-26 16:40:37 UTC, end at Wed 2021-01-27 07:27:28 UTC. --
Jan 26 16:40:38 kafka-test1006 systemd[1]: Started ifup for ens5.
Jan 26 16:40:38 kafka-test1006 sh[487]: RTNETLINK answers: File exists
Jan 26 16:40:38 kafka-test1006 sh[487]: ifup: failed to bring up ens5
Jan 26 16:40:38 kafka-test1006 systemd[1]: ifup@ens5.service: Main process exited, code=exited, status=1/FAILURE
Jan 26 16:40:38 kafka-test1006 systemd[1]: ifup@ens5.service: Failed with result 'exit-code'.

The interface is up and running fine:

elukey@kafka-test1006:~$ ip link show ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether aa:00:00:d2:81:d8 brd ff:ff:ff:ff:ff:ff

elukey@kafka-test1006:~$ ip addr
[..]
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether aa:00:00:d2:81:d8 brd ff:ff:ff:ff:ff:ff
    inet 10.64.16.146/22 brd 10.64.19.255 scope global ens5
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:102:10:64:16:146/64 scope global dynamic mngtmpaddr 
       valid_lft 2591992sec preferred_lft 604792sec
    inet6 fe80::a800:ff:fed2:81d8/64 scope link 
       valid_lft forever preferred_lft forever

Restarting the service doesn't work (same error), the only thing that I found working (to clear the error) is ip addr flush followed by either a restart of the service or ifup ens5. Any idea why this happening?

I left kafka-test1006 unfixed to let people check :)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

@akosiaris not reliably, but today I rebooted the 4 schema VMs and one of them got back with the same issue..

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

Yeah, it's definitely a race of sort. I've seen it a few times times before, maybe 1 on 50 reboots or less.

This is weird. I don't think we have encountered this before.

ExecStop in the systemd unit file runs ifdown ens5 but running that on the host returns

root@kafka-test1006:~# ifdown ens5
ifdown: interface ens5 not configured

At which point, all the rest is fully expected. ifup (in ExecStart) is trying to bring up an already up interface and gets back the RTNETLINK answers: File exists and is marked as failed.

What I don't know is why it fails in the first place. Logs point out that the interface is renamed 1 sec before the systemd unit fails. Interestingly some services (e.g. rsyslog-exporter) report having started and knowing the IP before systemd reports the error, but that's probably because it's passed in their configuration. Others actually fail, e.g. nrpe says Bind to port 5666 on 10.64.16.146 failed: Cannot assign requested address which makes more sense.

@elukey is this reproducible reliably? I can't shake the feeling it's a race condition.

Yeah, it's definitely a race of sort. I've seen it a few times times before, maybe 1 on 50 reboots or less.

Do you by any chance remember if it was on VMs only? Or was it physical hosts too?

I recall VMs only from my past experience, I encountered this problem a couple of times before this one.

Do you by any chance remember if it was on VMs only? Or was it physical hosts too?

From my memory only VMs. I've checked my IRC logs for cases where people pinged me or where it was logged in SAL and it only returns VMs as well (puppetdb1002, idp1001, air-airflow1001, now the schema host)

It's most probably some long-standing technical debt in ifupdown which we're seeing on VMs because virtio interfaces are quicker than real hardware. It seems unlikely we'll find a fix here (other than migrating to systemd-networkd in the future) and there's no harm other than alert noise. As such, my proposal would be to add a class similar to toil::systemd_scope_cleanup to all VMs which simply cleans this up automatically?

I 'll take your word for it. +1 on the cleanup thing.

Cleaned up with sudo ip addr flush ens5; sudo systemctl restart ifup@ens5 in tmux kafka-test1010, kafka-test1009 and kafka-test1007

Mentioned in SAL (#wikimedia-operations) [2021-02-04T08:22:03Z] <moritzm> reset failed ifup@ens5 on xhgui2001/xhgui1001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-19T13:41:40Z] <godog> reset-failed ifup@ens13 on prometheus5001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-22T13:31:52Z] <godog> reset-failed ifup@ens14 on prometheus3001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-02-25T11:22:46Z] <moritzm> reset-failed ifup@ens5.service on otrs1001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-06-11T05:47:17Z] <elukey> run systemctl reset-failed ifup@en5.service on doh1001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-07-13T06:53:05Z] <elukey> systemctl reset-failed ifup@ens5 on gitlab2001 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-09-04T09:00:31Z] <elukey> systemctl reset-failed ifup@ens6.service on puppetdb2002 - T273026

Mentioned in SAL (#wikimedia-operations) [2021-09-27T09:29:43Z] <moritzm> systemctl reset-failed networking T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-16T16:22:27Z] <moritzm> systemctl reset-failed ifup@esn13 on durum5001 after restart T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-16T16:23:34Z] <herron> systemctl reset-failed ifup@ens13 on prometheus5001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-23T18:00:43Z] <moritzm> systemctl reset-failed ifup@ens5.service on durum2001 T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-24T11:49:21Z] <moritzm> systemctl reset-failed ifup@ens5.service on poolcounter2003 T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-24T14:28:53Z] <godog> systemctl reset-failed ifup@ens5.service on logstash2024 T273026

Mentioned in SAL (#wikimedia-operations) [2021-11-24T16:00:09Z] <btullis> systemctl reset-failed ifup@ens5.service on schema2004 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-11T15:22:44Z] <arnoldokoth> systemctl reset-failed ifup@ens5.service on otrs1001 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-12T19:17:13Z] <mutante> zookeeper-test1002 - CRITICAL - degraded: The following units failed: ifup@ens5.service - for this issue see T273026 (T268074)

Mentioned in SAL (#wikimedia-operations) [2022-01-13T10:27:25Z] <moritzm> systemctl reset-failed ifup@ens5.service on lists1001 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-13T14:47:05Z] <moritzm> systemctl reset-failed ifup@ens5.service on idp1001 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-14T09:00:57Z] <moritzm> systemctl reset-failed ifup@ens5.service on an-tool1005 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-14T10:50:52Z] <moritzm> systemctl reset-failed ifup@ens5.service on an-test-ui1001 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-14T11:00:39Z] <moritzm> systemctl reset-failed ifup@ens5.service on archiva1002 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-17T11:00:04Z] <moritzm> systemctl reset-failed ifup@ens5.service on kubetcd1005 T273026

Mentioned in SAL (#wikimedia-operations) [2022-01-17T14:41:51Z] <moritzm> systemctl reset-failed ifup@ens5.service on an-airflow1001 T273026

This is starting to show up rather frequently, so I am wondering whether it is starting to consume enough time to warrant solving it somehow. Finding the race might prove to be "interesting" (aka consuming a lot of time), so we might also just want to silence this alert?

There is indeed a race condition between networking.service and ifup@ens5.service. Checked on a couple of VMs that did not exhibit this problem as well as some that did and what is clear from the output of journalctl is that the cases where this shows up is when the 2 units are running in parallel (if the 2 units are executed sequentially, this doesn't happen). Both units call ifup.

networking.service has:

ExecStart=/sbin/ifup -a --read-environment

and ifup@.service has:

ExecStart=/bin/sh -ec 'ifup --allow=hotplug %I; ifquery --state %I'

ens5 is marked as allow-hotplug in /etc/network/interfaces, so ifup@ will try to raise the interface. The command 'ifup --allow=hotplug %I; ifquery --state %I' initially gives the impression that ifquery will be executed anyway since a semicolon (;) is used (and not a double ampersand - &&), but the -e argument to /bin/sh negates that. If ifup fails, then the entire ExecStart fails. ifup isn't truly meant to be called simultaneously by multiple processes. It's not even doing the dirty work itself, it's executing ip for that. So there is plently of room for race conditions with errors to show up. The man page is pretty clear that ifupdown's internal datastore can become inconsistent causing issues

The  program  keeps  records of whether network interfaces are up or down.  Under exceptional circumstances these records can become inconsistent with the real states of the interfaces.  For example, an interface that was brought up using ifup and later deconfigured using ifconfig will still be recorded as up

This probably only happens on VMs because they are faster to initialize than hardware, however that's only a speculation right now.

So, ways to fix this that come in mind (not an exhaustive list):

  • Override ExecStart and remove the -e from the sh call for ifup@.service. That should allow the shell pipeline to continue normally. From the syntax of that pipeline it does seem that this was the original intent anyway. However, the commit introducing it is 96348b338 and gives the impression that this form was used to emulate the successful call and finished execution of ExecStart which did not work with Type=Simple. So, it's probably intentional.
  • Remove allow-hotplug for ens5 from /etc/network/interfaces and switch it to auto ens5. That will make the ifup@.service be a noop as the --allow=hotplug flag to ifup will no longer match ens5. The thing is, unless I am mistaken, this is being configured by d-i, and we up to now purposefully don't try to mess with /etc/network/interfaces in puppet. So we will need to see why d-i does that and if it makes sense to not be done. Even if we figure this out, we will need to manually fix all VMs.
  • Stop executing ifup@.service. Interestingly ifup@.service would be better named as ifupdown-hotplug@.service. See Debian 809171 for the long story, but it is executed by udev when interfaces are being hotplugged/unplugged via /lib/udev/rules.d/80-ifupdown.rules and the ifupdown-hotplug script in /lib/udev. While hotplugging network interfaces in VMs is not a thing in our setup (Ganeti VMs require a reboot to see new hardware and we haven't had the need to hotplug stuff yet), messing with udev rules isn't something I am keen on to fix a race condition.
  • Mess a bit with ordering of ifup@.service to make sure it is always executes after networking.service. Right now, they have no ordering or dependency tracking between them. See a dot graph of the output of systemd-analyze dot ifup@.service networking.service.
    a.png (443×2 px, 154 KB)
    . Color legend is:
black     = Requires
red       = Conflicts
green     = After

Adding an After=networking.service will solve the issue. It might also introduce consequences that are not clear right now like making this more difficult to forward port.

  • Get rid of ifupdown and /etc/network/interfaces and get a proper and modern network interface manager. See T234207. That's what I would prefer most, but it's a big project.
  • Get rid of ifupdown and /etc/network/interfaces and get a proper and modern network interface manager. See T234207. That's what I would prefer most, but it's a big project.

Nice detective work! (I think the rate has stayed pretty much the same, it's just that we now track the occurrences more systematically).

I'm planning to add support for systemd-networkd this (calendar) year, exact quarter TBD. This will fix this issue (and several others) for good so I'm a little reluctant to spend time to this given the impact is just some alert spam (and every fix has the chance to cause regressions which we typically only notice on future reboots). One other option would be to simply add a reconciliation systemd timer which runs 2 minutes post boot and runs "systemctl reset-failed" on the ifup service if present.

  • Get rid of ifupdown and /etc/network/interfaces and get a proper and modern network interface manager. See T234207. That's what I would prefer most, but it's a big project.

Nice detective work! (I think the rate has stayed pretty much the same, it's just that we now track the occurrences more systematically).

I'm planning to add support for systemd-networkd this (calendar) year, exact quarter TBD. This will fix this issue (and several others) for good so I'm a little reluctant to spend time to this given the impact is just some alert spam (and every fix has the chance to cause regressions which we typically only notice on future reboots).

😍

One other option would be to simply add a reconciliation systemd timer which runs 2 minutes post boot and runs "systemctl reset-failed" on the ifup service if present.

That would work just fine as a stopgap if we don't want the alert spam while we migrate to systemd-networkd.

Mentioned in SAL (#wikimedia-operations) [2022-03-08T10:51:01Z] <btullis> btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Mentioned in SAL (#wikimedia-operations) [2022-05-11T09:27:54Z] <jayme> systemctl reset-failed ifup@ens5.service on registry2003 - T273026

Mentioned in SAL (#wikimedia-operations) [2022-05-25T06:20:38Z] <elukey> elukey@an-tool1011:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-analytics) [2022-05-25T06:20:46Z] <elukey> elukey@an-tool1011:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-analytics) [2022-06-13T13:20:14Z] <btullis> btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:22:10Z] <jayme> kubernetes2005:~$ sudo systemctl status ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:37:43Z] <jayme> kubernetes2006:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:55:13Z] <jayme> kubernetes2016:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-18T11:00:54Z] <jayme> kubernetes2015:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:20:06Z] <moritzm> fix up network config for ldap-replica2006 T273026

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:20:56Z] <jayme> kubernetes1016:~$ sudo systemctl reset-failed ifup@ens13.service - T273026

Mentioned in SAL (#wikimedia-operations) [2022-10-21T12:35:01Z] <claime> rebooted kubernetes2006.codfw.wmnet manually - root cause T273026

Mentioned in SAL (#wikimedia-operations) [2022-10-21T13:32:16Z] <claime> kubernetes1005:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Mentioned in SAL (#wikimedia-operations) [2022-10-24T08:26:59Z] <claime> kubernetes1006:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Mentioned in SAL (#wikimedia-operations) [2022-10-24T08:37:02Z] <claime> kubernetes1015:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Mentioned in SAL (#wikimedia-operations) [2022-10-24T09:17:22Z] <claime> kubernetes2015:~$ sudo systemctl reset-failed ifup@ens13.service T273026

Removing the Ganeti tag, this is unrelated to Ganeti and only caused by ifupdown (and will eventually be solved by systemd-networkd). It's only that the Ganeti VM bootup is much faster, so we're seeing the race far more often.

Change 854553 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] toil: add bandaid for ifupdown race

https://gerrit.wikimedia.org/r/854553

Change 854553 merged by Filippo Giunchedi:

[operations/puppet@production] toil: add bandaid for ifupdown race

https://gerrit.wikimedia.org/r/854553

I have deployed the bandaid above, so ifup failures will be reset (only once) on Ganeti VMs two minutes after boot. We're of course papering over the problem, though given the previous context in the task this seems the lesser evil, at least until we have ifupdown

fgiunchedi claimed this task.

I'll optimistically resolve the task, though of course reopen if sth is amiss!

Change 854936 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] toil::ganeti_ifupdown: fix systemctl path

https://gerrit.wikimedia.org/r/854936

Change 854936 merged by Elukey:

[operations/puppet@production] toil::ganeti_ifupdown: fix systemctl path

https://gerrit.wikimedia.org/r/854936

@aborrero do you have more details on what happened with those?

I'm not sure the symptoms are the same. In the Ganeti case the hypervisor is changing the PCIe layout of the virtual machine, which causes Linux to assign a different netdev name when it reboots.

Looking at cloudservices2005-dev it still seems to have it's primary IP on device name 'eno1'? Did it reboot with a different name initially or something? How did you resolve it?

@aborrero do you have more details on what happened with those?

I'm not sure the symptoms are the same. In the Ganeti case the hypervisor is changing the PCIe layout of the virtual machine, which causes Linux to assign a different netdev name when it reboots.

Looking at cloudservices2005-dev it still seems to have it's primary IP on device name 'eno1'? Did it reboot with a different name initially or something? How did you resolve it?

I don't know exactly what happened.

My hunch is that the systemd service has been in failed state for months but somehow the task was created only yesterday (because $some monitoring/alerting change that I couldn't trace).

I solved them with the same reset-failed command that others have been using in the past.

I don't know exactly what happened.

My hunch is that the systemd service has been in failed state for months but somehow the task was created only yesterday (because $some monitoring/alerting change that I couldn't trace).

I solved them with the same reset-failed command that others have been using in the past.

Sorry sorry my bad, I should have looked closer at this task, I was mixing it up with T272555, I see this is a different set of symptoms. Not related to netdev name changing.

Must be a race condition of some kind I'm guessing but not sure what it might be. Probably another sign that ifupdown is past its sell-by date and we should investigate alternatives.

Must be a race condition of some kind I'm guessing but not sure what it might be. Probably another sign that ifupdown is past its sell-by date and we should investigate alternatives.

Yes, this is a long-standing race in ifupdown and will only be fixed for good once we have phased it out. The race is much easier to trigger on VMs (since they boot faster) than baremetal and for VMs we apply a bandaid via a systemd timer to clean these up: https://gerrit.wikimedia.org/r/c/operations/puppet/+/854553/

If any baremetal servers have a complex boot where this is also triggered often, we can also add it there.

(Adding Data-Engineering project tag for re-triage, as Analytics-Radar is an inactive project tag after Analytics got archived)