https://wikitech.wikimedia.org/wiki/Incident_documentation/20170613-ORES
[15:00:54] <halfak> mutante, is this something that could have been noticed earlier with icinga? E.g. maybe we should have a check for each node individually? [15:01:01] * halfak thinks about followup tasks. [15:07:34] <mutante> halfak: yea, probably. there is 5xx error detection where Icinga asks graphite for the error rate https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=5xx [15:08:27] <mutante> halfak: unless there is an even better way to check when actual users see errors [15:11:14] <halfak> mutante, cool I'll add those notes to a phab card. Thanks :) [15:11:18] <mutante> the best monitoring would be if it tests something at a high level, behaves like a user [15:11:23] <mutante> yw