Page MenuHomePhabricator

Network unreachable after network-online.target is brought up
Open, MediumPublic

Description

On various systems in our fleet the following can be observed at boot:

  1. network link goes up
  2. network card gets configured
  3. network-online.target is reached
  4. link goes down
  5. services configured to start After=network-online.target fail to start properly
  6. link comes back up
Oct 24 11:36:32 cp3059 kernel: bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Fl
Oct 24 11:36:33 cp3059 systemd[1]: Reached target Network is Online.
Oct 24 11:36:33 cp3059 kernel: bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Down
[...]
Oct 24 11:36:33 cp3059 lldpd[1117]: error while receiving frame on enp59s0f1d1 (retry: 0): Network is down
Oct 24 11:36:33 cp3059 lldpcli[1116]: lldpd should resume operations
Oct 24 11:36:33 cp3059 lldpd[1084]: 2019-10-24T11:36:33 [INFO/lldpctl] lldpd should resume operations
[...]
Oct 24 11:36:33 cp3059 kernel: bnxt_en 0000:3b:00.0 enp59s0f0: NIC Link is Up, 10000 Mbps full duplex, Flow control: none

The issue was particularly clear after reimaging cp5011 to the text-ats role (T227432) and noticing that multiple important services failed to start at boot.

Note that other hosts have similar boot troubles without the network card driver saying much (as opposed to bnxt_en above).

Oct 09 13:31:32 db1075 systemd[1]: Reached target Network is Online.
[... lots of normal-looking output]
Oct 09 13:31:32 db1075 lldpd[649]: error while receiving frame on eno2 (retry: 0): Network is down
Oct 09 13:31:32 db1075 lldpd[641]: 2019-10-09T13:31:32 [WARN/interfaces] error while receiving frame on eno2 (retry: 0): Network is down
Oct 09 13:31:32 db1075 lldpd[641]: 2019-10-09T13:31:32 [WARN/interfaces] error while receiving frame on eno3 (retry: 0): Network is down
Oct 09 13:31:32 db1075 lldpd[641]: 2019-10-09T13:31:32 [WARN/interfaces] error while receiving frame on eno4 (retry: 0): Network is down
Oct 09 13:31:32 db1075 systemd[1]: Started Login Service.
Oct 09 13:31:32 db1075 lldpd[649]: error while receiving frame on eno3 (retry: 0): Network is down
Oct 09 13:31:32 db1075 lldpd[641]: 2019-10-09T13:31:32 [INFO/lldpctl] lldpd should resume operations
Oct 09 13:31:32 db1075 lldpd[649]: error while receiving frame on eno4 (retry: 0): Network is down
Oct 09 13:31:32 db1075 lldpcli[648]: lldpd should resume operations
[...]
Oct 09 13:31:37 db1075 rsyslogd[605]: cannot resolve hostname 'centrallog1001.eqiad.wmnet' [v8.1901.0 try https://www.rsyslog.com/e/2027 ]

Event Timeline

ema created this task.Nov 4 2019, 11:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2019, 11:18 AM

The lldpd unit only depends on network.target, but network-online.target, per systemd-special(7) lldpd.service only the latter will postpone startup until the network interface is fully setup

ema added a project: netops.Nov 4 2019, 1:02 PM
ema updated the task description. (Show Details)
MoritzMuehlenhoff triaged this task as Medium priority.Nov 4 2019, 4:05 PM
ema moved this task from Triage to General on the Traffic board.Nov 5 2019, 3:35 PM