Conditions for an alert:
- Median precached scoring delay rises above 4 seconds
- Serving non-503 500 type responses
- Mean overload (503) errors per second rises above 1 for more than a minute.
There are probably more.
Conditions for an alert:
There are probably more.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
ores: move monitoring to icinga | operations/puppet | production | +16 -17 | |
ores: Add paging icinga check for home page | operations/puppet | production | +21 -0 |
So while we should have checks for all these things, I think a full outage (aka 2 in that list) is the only thing that we should actively page for (aka send out an SMS!). I'll work on that this week.
Change 256158 had a related patch set uploaded (by Yuvipanda):
ores: Add paging icinga check for home page
@yuvipanda just noticed the "non 503" part above. so far we don't exclude that or check for specific status codes. check_http can with -e but we have to list the expected code. not sure we can just exclude one that would usually make it turn CRIT by default
Change 256160 had a related patch set uploaded (by Dzahn):
ores/tools: fix duplicate icinga service name
follow-up fixes:
fixed puppet run here: < icinga-wm> RECOVERY - puppet last run on labcontrol1001 is OK: OK
now icinga config error because of the virtual host ..should be gone after next run on neon hopefully
nope.. virtual host does not get created and i do not see why.. we do the exact same thing elsewhere.
disabled temp. for now https://gerrit.wikimedia.org/r/#/c/256165/ to make sure Icinga config isn't broken until we find out why
Change 256376 had a related patch set uploaded (by Dzahn):
ores: move monitoring to icinga
it works now after i moved this into the icinga module.
virtual host:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=ores.wmflabs.org
service:
you should get email now if anything happens here. and per our IRC conversation @Halfak will use google filters to turn the mail into push notifications. i assume we are all done here now and slightly renamed the ticket.