HAProxy must start after network is really up
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Fabfur
	Tue, Jun 4, 2:27 PM

Description

A recent restart in some cp hosts highlighted that HAProxy (more precisely, the ExecStartPre=/usr/local/sbin/update-ocsp-all script) fails to start due to unreachable network.

Looking at the journal, the NIC is reported as "UP" about 2 seconds after systemd states that network is up. The current unit for haproxy has the

After=network.target

And apparently this is not sufficient to manage this kind of conditions.

Documentation in this sense suggest the use of network-online.target instead but the log timeline suggests that this could not be sufficient:

Jun 4 13:35:01 cp7010 systemd[1]: Reached target Network is Online.
Jun  4 13:35:01 cp7010 systemd[1]: Starting HAProxy Load Balancer...
Jun  4 13:35:03 cp7010 kernel: [   25.257411] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: none
Jun  4 13:35:03 cp7010 kernel: [   25.257413] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR

We should investigate if:

This issue is reproducible every time the host is rebooted
This issue is only related to magru or affects other hosts in other DCs too
Changing After=network.target to After=network-online.target fixes this or a better way to check if network is actually reachable is needed.
Other services could benefit from this change

Details

	Subject	Repo	Branch	Lines +/-
	hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects

Mentioned Here: T344604: NIC autonegotiation takes 4s in esams

Event Timeline

Fabfur created this task.Tue, Jun 4, 2:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Jun 4, 2:27 PM

Host rebooted by fabfur@cumin1002 with reason: Test haproxy dependencies

Change #1038872 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes

https://gerrit.wikimedia.org/r/1038872

gerritbot added a project: Patch-For-Review.Tue, Jun 4, 5:13 PM

Change #1038872 merged by Ssingh:

[operations/puppet@production] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes

https://gerrit.wikimedia.org/r/1038872

Maintenance_bot removed a project: Patch-For-Review.Tue, Jun 4, 5:30 PM

On investigation, we found that (cp7001):

[Tue Jun  4 16:15:46 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: ON - receive
[Tue Jun  4 16:15:46 2024] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR
[Tue Jun  4 16:15:46 2024] Process accounting resumed
[Tue Jun  4 16:15:47 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Down
[Tue Jun  4 16:15:50 2024] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: none
[Tue Jun  4 16:15:50 2024] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR

This is similar to the issue in T344604 so we have resolved this by disabling flow control in the pre-up (and not up) for eno12399np0.

HAProxy must start after network is really upClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

HAProxy must start after network is really up
Closed, ResolvedPublic
Actions