Page MenuHomePhabricator

Setup monitoring notifications for ORES
Closed, ResolvedPublic

Description

Conditions for an alert:

  1. Median precached scoring delay rises above 4 seconds
  2. Serving non-503 500 type responses
  3. Mean overload (503) errors per second rises above 1 for more than a minute.

There are probably more.

Event Timeline

Halfak assigned this task to yuvipanda.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.
Halfak subscribed.

So while we should have checks for all these things, I think a full outage (aka 2 in that list) is the only thing that we should actively page for (aka send out an SMS!). I'll work on that this week.

Change 256158 had a related patch set uploaded (by Yuvipanda):
ores: Add paging icinga check for home page

https://gerrit.wikimedia.org/r/256158

Change 256158 merged by Dzahn:
ores: Add paging icinga check for home page

https://gerrit.wikimedia.org/r/256158

@yuvipanda just noticed the "non 503" part above. so far we don't exclude that or check for specific status codes. check_http can with -e but we have to list the expected code. not sure we can just exclude one that would usually make it turn CRIT by default

Change 256160 had a related patch set uploaded (by Dzahn):
ores/tools: fix duplicate icinga service name

https://gerrit.wikimedia.org/r/256160

follow-up fixes:

  1. https://gerrit.wikimedia.org/r/#/c/256160/
  2. https://gerrit.wikimedia.org/r/#/c/256161/

fixed puppet run here: < icinga-wm> RECOVERY - puppet last run on labcontrol1001 is OK: OK

now icinga config error because of the virtual host ..should be gone after next run on neon hopefully

nope.. virtual host does not get created and i do not see why.. we do the exact same thing elsewhere.

disabled temp. for now https://gerrit.wikimedia.org/r/#/c/256165/ to make sure Icinga config isn't broken until we find out why

Change 256376 had a related patch set uploaded (by Dzahn):
ores: move monitoring to icinga

https://gerrit.wikimedia.org/r/256376

Change 256376 merged by Dzahn:
ores: move monitoring to icinga

https://gerrit.wikimedia.org/r/256376

Dzahn renamed this task from Setup paging for ORES to Setup monitoring notifications for ORES.Dec 2 2015, 2:11 AM

you should get email now if anything happens here. and per our IRC conversation @Halfak will use google filters to turn the mail into push notifications. i assume we are all done here now and slightly renamed the ticket.

feel free to reopen if you think something is missing