Page MenuHomePhabricator

Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24
Closed, ResolvedPublic

Description

Per IRC conversation with @dcausse , we are getting alerts for multiple systemd services on the graph split hosts (wdqs1022-24). Creating this ticket to:

  • Troubleshoot the issue and hopefully determine root cause and/or mitigate the issue

These hosts becomes unreachable for 1 or 2 hours almost every mornings somewhere between 7 and 11 UTC.

Some graphs

systemd failure counts / timing (click on wdqs-test in the graph to filter for just these hosts): https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&from=1700503309983&to=1703092593442&viewPanel=2

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-12-06T18:53:50Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-06T18:54:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Host rebooted by bking@cumin2002 with reason: None

Mentioned in SAL (#wikimedia-operations) [2023-12-06T23:03:00Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-06T23:03:18Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Gehel triaged this task as High priority.Dec 7 2023, 9:50 AM
Gehel moved this task from Incoming to Observability on the Data-Platform-SRE board.
dcausse renamed this task from Troubleshoot recurring systemd unit failures for wdqs1022-24 to Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24.Dec 19 2023, 8:40 AM
dcausse updated the task description. (Show Details)

Related issue mentioned by Volans:

wdqs1024 has a unit that is failing and complains that the service is not here at all: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service

Some loose theories with proposals how to test them:

  • Hosts have outdated firmware/update firmware
  • NFS connections are affected by some recurring task on the dumps server/remove NFS

Mentioned in SAL (#wikimedia-operations) [2023-12-20T14:58:50Z] <inflatador> bking@cumin2002 disable/mask wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories on wdqs102[24] T352878

Per IRC conversation in #wikimedia-sre , this issue has affected a few servers in the past (see T199911 and T265323 ). As such, we've created a Puppet class which could potentially mitigate this issue. I'll get a patch started and see if we can fix this issue.

Change 984620 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Work around systemd unit failures

https://gerrit.wikimedia.org/r/984620

Host rebooted by bking@cumin2002 with reason: None

Some loose theories with proposals how to test them:

  • Hosts have outdated firmware/update firmware

FYI, I tried updating the firmware...all hosts had up-to-date firmware, except wdqs1022 which had out-of-date iDRAC firmware. I've updated to the latest available iDRAC version as of this writing (7.0.0).

Change 984620 merged by Bking:

[operations/puppet@production] wdqs: Work around systemd unit failures

https://gerrit.wikimedia.org/r/984620

Change 984648 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: graph split hosts don't need categories

https://gerrit.wikimedia.org/r/984648

Mentioned in SAL (#wikimedia-operations) [2023-12-20T22:58:04Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-20T22:58:27Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:07:50Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on 13 hosts with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:08:15Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on 13 hosts with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:09:06Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on 10 hosts with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:09:39Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on 10 hosts with reason: T352878

Change 984648 merged by Bking:

[operations/puppet@production] wdqs: graph split hosts don't need categories

https://gerrit.wikimedia.org/r/984648

I merged the above patch, but I noticed the unit wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.timer was not removed. I've persistently disabled this service and we can check this graph in 24 hours to see if that's cleaned up all the systemd failures. If so, we should be able to close this ticket.

bking claimed this task.

I don't see any new unit failures since we deployed the patch. Closing...