During the recent sessionstore outage, Kask was outputting lots of error messages the moment the outage started, but this wasn't noticed until other alerts indicated the level of the outage. Kask is usually a very terse service as far as logging is concerned - we should monitor and alert upon high levels of error messages.
Description
Related Objects
Event Timeline
This alerting would have been helpful for another a recent incident of the same nature.
I opened T327960 as well, it covers alerting based on the rate of HTTP status 500 responses. It is (currently) the case that every status 500 will also emit an error log, so it would pick up on incidents like 2023-01-24_sessionstore_quorum_issues equally as well. There are however examples where an error can be logged that does not correspond with a user-facing event (a 500), and T327960 would not cover those cases; As you say, Kask's signal to noise ratio for error logging is quite good, so alerting on the rate of error messages might be generally useful.
TL;DR
If we choose one over the other, I propose that be T327960. If we choose to do both, we should consider whether we want them both to page (given the overlap, it seems to have the potential to be confusing and/or noisy).
Thoughts?
If the result of any errors in Kask is guaranteed to manifest as a 500 but not the other way around, I agree with monitoring only the status code.