Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Dec 6 2023, 3:43 PM

Description

Per IRC conversation with @dcausse , we are getting alerts for multiple systemd services on the graph split hosts (wdqs1022-24). Creating this ticket to:

Troubleshoot the issue and hopefully determine root cause and/or mitigate the issue

These hosts becomes unreachable for 1 or 2 hours almost every mornings somewhere between 7 and 11 UTC.

Some graphs

systemd failure counts / timing (click on wdqs-test in the graph to filter for just these hosts): https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&from=1700503309983&to=1703092593442&viewPanel=2

Details

	Subject	Repo	Branch	Lines +/-
	wdqs: graph split hosts don't need categories	operations/puppet	production	+3 -1
	wdqs: Work around systemd unit failures	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Gehel	T350464 Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph
Resolved	bking	T352878 Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24

Event Timeline

bking created this task.Dec 6 2023, 3:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 6 2023, 3:43 PM

Mentioned in SAL (#wikimedia-operations) [2023-12-06T18:53:50Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-06T18:54:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Host rebooted by bking@cumin2002 with reason: None

bking updated the task description. (Show Details)Dec 6 2023, 6:55 PM

Mentioned in SAL (#wikimedia-operations) [2023-12-06T23:03:00Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-06T23:03:18Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 34 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T352878

Gehel triaged this task as High priority.Dec 7 2023, 9:50 AM

Gehel moved this task from Incoming to Observability on the Data-Platform-SRE board.

dcausse renamed this task from Troubleshoot recurring systemd unit failures for wdqs1022-24 to Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24.Dec 19 2023, 8:40 AM

dcausse updated the task description. (Show Details)

dcausse added a parent task: T350464: Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph.Dec 19 2023, 8:44 AM

Related issue mentioned by Volans:

wdqs1024 has a unit that is failing and complains that the service is not here at all: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.service

Some loose theories with proposals how to test them:

Hosts have outdated firmware/update firmware

NFS connections are affected by some recurring task on the dumps server/remove NFS

Puppet code for graph split hosts has bugs/apply graph split code to another host and see if issues persist.

Mentioned in SAL (#wikimedia-operations) [2023-12-20T14:58:50Z] <inflatador> bking@cumin2002 disable/mask wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories on wdqs102[24] T352878

Per IRC conversation in #wikimedia-sre , this issue has affected a few servers in the past (see T199911 and T265323 ). As such, we've created a Puppet class which could potentially mitigate this issue. I'll get a patch started and see if we can fix this issue.

Change 984620 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Work around systemd unit failures

https://gerrit.wikimedia.org/r/984620

gerritbot added a project: Patch-For-Review.Dec 20 2023, 3:37 PM

Host rebooted by bking@cumin2002 with reason: None

In T352878#9418736, @bking wrote:

Some loose theories with proposals how to test them:

Hosts have outdated firmware/update firmware

FYI, I tried updating the firmware...all hosts had up-to-date firmware, except wdqs1022 which had out-of-date iDRAC firmware. I've updated to the latest available iDRAC version as of this writing (7.0.0).

RKemper updated the task description. (Show Details)Dec 20 2023, 6:03 PM

Change 984620 merged by Bking:

[operations/puppet@production] wdqs: Work around systemd unit failures

https://gerrit.wikimedia.org/r/984620

Maintenance_bot removed a project: Patch-For-Review.Dec 20 2023, 7:30 PM

Change 984648 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: graph split hosts don't need categories

https://gerrit.wikimedia.org/r/984648

gerritbot added a project: Patch-For-Review.Dec 20 2023, 10:44 PM

Mentioned in SAL (#wikimedia-operations) [2023-12-20T22:58:04Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-20T22:58:27Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on wdqs[1020-1024].eqiad.wmnet with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:07:50Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on 13 hosts with reason: T352878

ops-monitoring-bot mentioned this in T353878: Service implementation for elastic2087-2109.Dec 21 2023, 2:08 PM

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:08:15Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on 13 hosts with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:09:06Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 18 days, 0:00:00 on 10 hosts with reason: T352878

Mentioned in SAL (#wikimedia-operations) [2023-12-21T14:09:39Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18 days, 0:00:00 on 10 hosts with reason: T352878

Change 984648 merged by Bking:

[operations/puppet@production] wdqs: graph split hosts don't need categories

https://gerrit.wikimedia.org/r/984648

I merged the above patch, but I noticed the unit wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-categories.timer was not removed. I've persistently disabled this service and we can check this graph in 24 hours to see if that's cleaned up all the systemd failures. If so, we should be able to close this ticket.

Maintenance_bot removed a project: Patch-For-Review.Jan 2 2024, 3:30 PM

dr0ptp4kt subscribed.Jan 2 2024, 5:08 PM

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE.Jan 11 2024, 10:26 AM

Gehel moved this task from Backlog to Needs Review on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

I don't see any new unit failures since we deployed the patch. Closing...

Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24Closed, ResolvedPublicActions