Page MenuHomePhabricator

Create alerts for high resource utilization on the stat servers
Closed, ResolvedPublic

Description

The stats servers are frequently becoming unresponsive (see this Slack thread for a summary of all the recent outages ). Each time a host gets unresponsive, the disruption affects multiple developers' workflows.

As of now, we are addressing this reactively, with developers pinging SREs, who then login to a very unresponsive box and manually seek out and kill processes.

Creating this ticket to add alerts for high load, the SRE team will at least be aware of the situation and perhaps be able to mitigate more quickly.

Event Timeline

Change #1064436 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] WIP: Add load alerts for stat hosts

https://gerrit.wikimedia.org/r/1064436

Change #1064436 merged by Bking:

[operations/alerts@master] Add load alerts for stat hosts

https://gerrit.wikimedia.org/r/1064436

Change #1065248 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] Data-platform: change severity of stat host high load alerts

https://gerrit.wikimedia.org/r/1065248

Change #1065248 merged by jenkins-bot:

[operations/alerts@master] Data-platform: change severity of stat host high load alerts

https://gerrit.wikimedia.org/r/1065248

Change #1066661 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] data-platform: fix deploy tags for stat_host

https://gerrit.wikimedia.org/r/1066661

Change #1066661 merged by jenkins-bot:

[operations/alerts@master] data-platform: fix deploy tags for stat_host

https://gerrit.wikimedia.org/r/1066661

I believe we have satisfied the AC for this ticket, so I'm going to close it out for now...