As a developer and operator of WDQS I would like to receive an alert at warning level when the ratio of failed queries increases compared to baseline.
While this metric is not entirely part of our SLIs (the SLO considers 403 and 419 acceptable responses), it would be useful to be aware of changes in traffic patterns to proactively monitor the services.
Note: the current implementation of "failed queries" in Grafana includes both 4xx and 5xx. Those are different failure scenarios, we should consider reporting and alerting them separately. In particular, we are interested in tracking timeouts.
AC
- we have a refined definition of "failed query", aligned with SLO.
- we have defined a quantifiable baseline of expected "failed queries" ratio.
- we have an alert (warning) for increased ratio of failed queries (and/or timeouts).