Page MenuHomePhabricator

HAProxy must start after network is really up
Closed, ResolvedPublic

Description

A recent restart in some cp hosts highlighted that HAProxy (more precisely, the ExecStartPre=/usr/local/sbin/update-ocsp-all script) fails to start due to unreachable network.

Looking at the journal, the NIC is reported as "UP" about 2 seconds after systemd states that network is up. The current unit for haproxy has the

After=network.target

And apparently this is not sufficient to manage this kind of conditions.

Documentation in this sense suggest the use of network-online.target instead but the log timeline suggests that this could not be sufficient:

Jun 4 13:35:01 cp7010 systemd[1]: Reached target Network is Online.
Jun  4 13:35:01 cp7010 systemd[1]: Starting HAProxy Load Balancer...
Jun  4 13:35:03 cp7010 kernel: [   25.257411] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: none
Jun  4 13:35:03 cp7010 kernel: [   25.257413] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR

We should investigate if:

  • This issue is reproducible every time the host is rebooted
  • This issue is only related to magru or affects other hosts in other DCs too
  • Changing After=network.target to After=network-online.target fixes this or a better way to check if network is actually reachable is needed.
  • Other services could benefit from this change

Event Timeline

Host rebooted by fabfur@cumin1002 with reason: Test haproxy dependencies

Host rebooted by fabfur@cumin1002 with reason: Test haproxy dependencies

Change #1038872 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes

https://gerrit.wikimedia.org/r/1038872

Change #1038872 merged by Ssingh:

[operations/puppet@production] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes

https://gerrit.wikimedia.org/r/1038872

ssingh claimed this task.
ssingh subscribed.

On investigation, we found that (cp7001):

[Tue Jun  4 16:15:46 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: ON - receive
[Tue Jun  4 16:15:46 2024] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR
[Tue Jun  4 16:15:46 2024] Process accounting resumed
[Tue Jun  4 16:15:47 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Down
[Tue Jun  4 16:15:50 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: none
[Tue Jun  4 16:15:50 2024] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR

This is similar to the issue in T344604 so we have resolved this by disabling flow control in the pre-up (and not up) for eno12399np0.