A recent restart in some cp hosts highlighted that HAProxy (more precisely, the ExecStartPre=/usr/local/sbin/update-ocsp-all script) fails to start due to unreachable network.
Looking at the journal, the NIC is reported as "UP" about 2 seconds after systemd states that network is up. The current unit for haproxy has the
After=network.target
And apparently this is not sufficient to manage this kind of conditions.
Documentation in this sense suggest the use of network-online.target instead but the log timeline suggests that this could not be sufficient:
Jun 4 13:35:01 cp7010 systemd[1]: Reached target Network is Online. Jun 4 13:35:01 cp7010 systemd[1]: Starting HAProxy Load Balancer... Jun 4 13:35:03 cp7010 kernel: [ 25.257411] bnxt_en 0000:4b:00.0 eno12399np0: NIC Link is Up, 25000 Mbps full duplex, Flow control: none Jun 4 13:35:03 cp7010 kernel: [ 25.257413] bnxt_en 0000:4b:00.0 eno12399np0: FEC autoneg off encoding: Clause 74 BaseR
We should investigate if:
- This issue is reproducible every time the host is rebooted
- This issue is only related to magru or affects other hosts in other DCs too
- Changing After=network.target to After=network-online.target fixes this or a better way to check if network is actually reachable is needed.
- Other services could benefit from this change