Page MenuHomePhabricator

Extend icinga check to catch 500 errors like those of the 20170613 incident
Closed, ResolvedPublic

Description

https://wikitech.wikimedia.org/wiki/Incident_documentation/20170613-ORES

[15:00:54] <halfak> mutante, is this something that could have been noticed earlier with icinga?  E.g. maybe we should have a check for each node individually?
[15:01:01] * halfak thinks about followup tasks.
[15:07:34] <mutante> halfak: yea, probably. there is 5xx error detection where Icinga asks graphite for the error rate  https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=5xx   
[15:08:27] <mutante> halfak: unless there is an even better way to check when actual users see errors
[15:11:14] <halfak> mutante, cool I'll add those notes to a phab card.  Thanks :) 
[15:11:18] <mutante> the best monitoring would be if it tests something at a high level, behaves like a user
[15:11:23] <mutante> yw

Event Timeline

Halfak created this task.Jun 13 2017, 8:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2017, 8:42 PM
Halfak assigned this task to Zppix.Jun 15 2017, 2:40 PM
Restricted Application added a project: User-Zppix. · View Herald TranscriptJun 15 2017, 2:40 PM

Looks like we can set up an icinga check via grafana for the rate of "retries" (likely failed requests). Relatedly the services team has been aiming to do the same for RESTBase. See T162765: Set up grafana alerting for services

Zppix added a comment.Jun 15 2017, 6:19 PM

Dzahn says that additonal (or new) checks can be setup in Icinga for the changeprop 500s

Zppix moved this task from Backlog to Wiki-AI on the User-Zppix board.Jun 15 2017, 6:44 PM
Zppix added a comment.Jun 15 2017, 6:58 PM

Looks like we can set up an icinga check via grafana for the rate of "retries" (likely failed requests). Relatedly the services team has been aiming to do the same for RESTBase. See T162765: Set up grafana alerting for services

The question would be whats the avg Grafana check rate (time between checks). how accurate, and how long will it take to setup?

Zppix added a subscriber: Dzahn.Jun 15 2017, 8:22 PM

@Dzahn says the grafana check rate on Icinga's end can be potientally custom or its default is ~ 5m

Zppix added a comment.Jun 16 2017, 3:00 AM

I talked with Halfak he agrees with going with, grafanna, however he wants the Graphite metrics for changeprop therefore i submitted a request for myself to be granted access. I will contiune investigating as much as possible to find a way to prevent this from occouring completely. I will also see if I can find if this has happened before and it was just never noticed, it appears however scb1001 is the only host that will be affected, I wonder if we need to distrbute ores throughout the entire scb cluster and load balance. If thats impossible then find a way to make sure scb1001 isnt the only instance that ores can depend on to run properly. I will also talk to pdrender's maintainers/services to see what we can do to limit interaction server resource wise between ores and them. If anyone has any relevent info on this incident that isnt already known please feel free to add it, or if you have suggestions. Thanks!

Zppix triaged this task as Medium priority.Jun 16 2017, 3:02 AM

ORES is already distributed through the SCB cluster with load balancing.

pdfrenderer has already limited resource usage. T167834

ORES will soon be moving to it's own dedicated hardware. T165171

Halfak added a parent task: Restricted Task.Jul 6 2017, 4:50 PM
Halfak claimed this task.Jul 7 2017, 8:41 PM
Halfak moved this task from Active to Review on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Zppix.
Halfak added a subscriber: akosiaris.

@akosiaris, see ^ No rush.

Ok maybe a little rush ;)

https://gerrit.wikimedia.org/r/#/c/363890/ has been merged on Jul 10th. Is the merged change above enough or is there more that needs to be done? I guess some testing ?

It looks like we just got an Icinga ping for a 500 response spike (very brief overload event) so it is certainly working :) We can tune when the event fires from grafana now.

Halfak closed this task as Resolved.Jul 17 2017, 2:31 PM
Halfak moved this task from Review to Done on the Scoring-platform-team (Current) board.