Per IRC conversation with @dcausse , we are getting alerts for multiple systemd services on the graph split hosts (wdqs1022-24). Creating this ticket to:
- Troubleshoot the issue and hopefully determine root cause and/or mitigate the issue
These hosts becomes unreachable for 1 or 2 hours almost every mornings somewhere between 7 and 11 UTC.
Some graphs
systemd failure counts / timing (click on wdqs-test in the graph to filter for just these hosts): https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&from=1700503309983&to=1703092593442&viewPanel=2