[15:00:54] <halfak> mutante, is this something that could have been noticed earlier with icinga? E.g. maybe we should have a check for each node individually? [15:01:01] * halfak thinks about followup tasks. [15:07:34] <mutante> halfak: yea, probably. there is 5xx error detection where Icinga asks graphite for the error rate https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=5xx [15:08:27] <mutante> halfak: unless there is an even better way to check when actual users see errors [15:11:14] <halfak> mutante, cool I'll add those notes to a phab card. Thanks :) [15:11:18] <mutante> the best monitoring would be if it tests something at a high level, behaves like a user [15:11:23] <mutante> yw
|Resolved||Halfak||T167830 Extend icinga check to catch 500 errors like those of the 20170613 incident|
I talked with Halfak he agrees with going with, grafanna, however he wants the Graphite metrics for changeprop therefore i submitted a request for myself to be granted access. I will contiune investigating as much as possible to find a way to prevent this from occouring completely. I will also see if I can find if this has happened before and it was just never noticed, it appears however scb1001 is the only host that will be affected, I wonder if we need to distrbute ores throughout the entire scb cluster and load balance. If thats impossible then find a way to make sure scb1001 isnt the only instance that ores can depend on to run properly. I will also talk to pdrender's maintainers/services to see what we can do to limit interaction server resource wise between ores and them. If anyone has any relevent info on this incident that isnt already known please feel free to add it, or if you have suggestions. Thanks!
This is ready to go.
We'll need to make a change like https://gerrit.wikimedia.org/r/362567 to enable it.