Page MenuHomePhabricator

Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often
Closed, ResolvedPublic

Description

As a followup of T322360, it was discovered that etcd/conf* hosts had a very high load, requesting configuration refreshes more often than it should. Causing very high load on the hosts:

Screenshot_20221103_214327.png (1×1 px, 188 KB)

This was only eventually discovered because the hosts run out of disk space due to access log spam. Ideally we could have caught this earlier by checking abnormal throughput, e.g. surpasing a limit, either globally of by host/service and alerting on it, so if it happens again it is detected (almost) immediately.

Event Timeline

I would suggest that the alert should be on a request per second threshold, rather than on high server load. I say this as the high request rate is what should have alerted us; a high server load isn't per se indicative of what is causing it.

I would suggest that the alert should be on a request per second threshold, rather than on high server load. I say this as the high request rate is what should have alerted us; a high server load isn't per se indicative of what is causing it.

Yeah, with load I didn't mean the cpu load associated with the uptime command, but a more abstract concept of load:// A metric to be decided by the service owners//. I can update the title, which I agree could be misleading.

jcrespo renamed this task from Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often to Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often.Dec 5 2022, 3:06 PM
Joe moved this task from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.
Joe added a subscriber: andrea.denisse.

Grabbing this task as part of sprint week.

Joe triaged this task as Medium priority.Mar 21 2023, 3:46 PM

Change 901622 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] etcd: add alert for high traffic volumes

https://gerrit.wikimedia.org/r/901622

Change 901622 merged by jenkins-bot:

[operations/alerts@master] etcd: add alert for high traffic volumes

https://gerrit.wikimedia.org/r/901622

Joe moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.