alert on too many close-to-saturated appservers / apiservers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Nov 4 2020, 2:20 AM

Description

Today we had a latency issue that was at least in part due to too many apiservers being close to out of idle workers (which exploded when several of them became out of idle workers):

Screenshot_20201103_211908.png (454×1 px, 40 KB)

https://w.wiki/jzT

We should have an alert when >20% (?) of the cluster has less than <25% (?) of their workers free. (Numbers to be tuned based on past saturation events.)

Details

	Subject	Repo	Branch	Lines +/-
	mediawiki::alerts add alert when 20% of servers is saturated	operations/puppet	production	+12 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		CDanis	T252605 Reliable metrics for idle/busy PHP-FPM workers
		Resolved		jijiki	T267176 alert on too many close-to-saturated appservers / apiservers

Event Timeline

CDanis created this task.Nov 4 2020, 2:20 AM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 4 2020, 2:20 AM

I agree that is a good idea!

jijiki claimed this task.Nov 4 2020, 5:15 PM

As discussed, here's a start on the query: https://w.wiki/k6F
Both thresholds in there need some tuning, but it's a start.

This should wind up being just a single check_prometheus rule defined alongside the existing check_prometheuses for the appservers: https://gerrit.wikimedia.org/g/operations/puppet/+/970baabd6f8b319c138ec0ed670852851150b6d5/modules/profile/manifests/mediawiki/alerts.pp#34

lmata moved this task from Inbox to Radar on the observability board.Nov 9 2020, 4:16 PM

jijiki moved this task from Inbox 🐅 to Next up 🥌 on the User-jijiki board.Nov 24 2020, 7:51 PM

jijiki moved this task from Next up 🥌 to In Progress 🏋️‍♀️ on the User-jijiki board.Feb 5 2021, 4:44 PM

Change 664319 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] (WIP) mediawiki::alerts add alert when 20% of servers is saturated

https://gerrit.wikimedia.org/r/664319

gerritbot added a project: Patch-For-Review.Feb 15 2021, 6:28 PM

Change 664319 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::alerts add alert when 20% of servers is saturated

https://gerrit.wikimedia.org/r/664319

jijiki closed this task as Resolved.Feb 24 2021, 9:34 AM

	F32428187: Screenshot_20201103_211908.png
	Nov 4 2020, 2:20 AM

alert on too many close-to-saturated appservers / apiserversClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

alert on too many close-to-saturated appservers / apiservers
Closed, ResolvedPublic
Actions

Related Objects
Search...