Page MenuHomePhabricator

Alert on Kask error rate
Closed, DeclinedPublic

Description

During the recent sessionstore outage, Kask was outputting lots of error messages the moment the outage started, but this wasn't noticed until other alerts indicated the level of the outage. Kask is usually a very terse service as far as logging is concerned - we should monitor and alert upon high levels of error messages.

Event Timeline

This alerting would have been helpful for another a recent incident of the same nature.

I opened T327960 as well, it covers alerting based on the rate of HTTP status 500 responses. It is (currently) the case that every status 500 will also emit an error log, so it would pick up on incidents like 2023-01-24_sessionstore_quorum_issues equally as well. There are however examples where an error can be logged that does not correspond with a user-facing event (a 500), and T327960 would not cover those cases; As you say, Kask's signal to noise ratio for error logging is quite good, so alerting on the rate of error messages might be generally useful.

TL;DR

If we choose one over the other, I propose that be T327960. If we choose to do both, we should consider whether we want them both to page (given the overlap, it seems to have the potential to be confusing and/or noisy).

Thoughts?

If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the status code.

akosiaris added a subscriber: akosiaris.

Adding serviceops, removing SRE to triage this towards the more specific SRE team.

If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the status code.

And this was completed in T327960; My 2¢ would be to close this issue (as wontfix?)

T327960 alerts on 500s, so this has already been implemented.