Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Nov 4 2022, 11:04 AM

Description

As a followup of T322360, it was discovered that etcd/conf* hosts had a very high load, requesting configuration refreshes more often than it should. Causing very high load on the hosts:

Screenshot_20221103_214327.png (1×1 px, 188 KB)

This was only eventually discovered because the hosts run out of disk space due to access log spam. Ideally we could have caught this earlier by checking abnormal throughput, e.g. surpasing a limit, either globally of by host/service and alerting on it, so if it happens again it is detected (almost) immediately.

Details

	Subject	Repo	Branch	Lines +/-
	etcd: add alert for high traffic volumes	operations/alerts	master	+34 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		jijiki	T322360 conf* hosts ran out of disk space due to log spam
		Resolved		Joe	T322400 Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often

Event Timeline

jcrespo created this task.Nov 4 2022, 11:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2022, 11:04 AM

jcrespo mentioned this in T322360: conf* hosts ran out of disk space due to log spam.Nov 4 2022, 11:05 AM

Ladsgroup subscribed.Nov 4 2022, 11:08 AM

jcrespo added a parent task: T322360: conf* hosts ran out of disk space due to log spam.Nov 4 2022, 11:12 AM

andrea.denisse claimed this task.Nov 4 2022, 2:15 PM

I would suggest that the alert should be on a request per second threshold, rather than on high server load. I say this as the high request rate is what should have alerted us; a high server load isn't per se indicative of what is causing it.

Joe edited projects, added serviceops-radar; removed serviceops.Nov 7 2022, 4:56 PM

JMeybohm added projects: Sustainability (Incident Followup), SRE-OnFire.Dec 1 2022, 11:05 AM

I would suggest that the alert should be on a request per second threshold, rather than on high server load. I say this as the high request rate is what should have alerted us; a high server load isn't per se indicative of what is causing it.

Yeah, with load I didn't mean the cpu load associated with the uptime command, but a more abstract concept of load:// A metric to be decided by the service owners//. I can update the title, which I agree could be misleading.

jcrespo renamed this task from Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often to Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often.Dec 5 2022, 3:06 PM

Joe added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 20 2023, 3:22 PM

Grabbing this task as part of sprint week.

Joe triaged this task as Medium priority.Mar 21 2023, 3:46 PM

Change 901622 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/alerts@master] etcd: add alert for high traffic volumes

https://gerrit.wikimedia.org/r/901622

gerritbot added a project: Patch-For-Review.Mar 21 2023, 3:46 PM

Change 901622 merged by jenkins-bot:

[operations/alerts@master] etcd: add alert for high traffic volumes

https://gerrit.wikimedia.org/r/901622

Maintenance_bot removed a project: Patch-For-Review.Mar 22 2023, 6:10 AM

Joe closed this task as Resolved.Mar 24 2023, 9:40 AM

Joe moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.

	F35705834: Screenshot_20221103_214327.png
	Nov 4 2022, 11:04 AM

Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too oftenClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often
Closed, ResolvedPublic
Actions

Related Objects
Search...