In T390706: Create SLO dashboard for tone (peacock) check model we defined an SLO for the tone check model along with its corresponding dashboards using Pyrra.
However as noted in {T403729#11200918} we cannot utilize this to report on the specific SLIs for a fixed calendar window.
In the context of the running A/B test for Tone Check we have defined some leading indicators in T394463: [A/B Test] Report on Tone Check leading indicators among which are also the 2 SLO metrics: service availability and latency. We would like to be able to report on these metrics by October 1st.
To overcome the above limitations we would like to explore the option of extracting the following metrics for a fixed calendar window by creating some queries in thanos on top of the already published prometheus metrics.
- The percentage of all requests receiving a non-error response (non 5xx).
- The percentage of all successful requests (2xx) that complete within 1000 milliseconds (1 sec), measured at the server side.
The above metrics should be calculated using istio metrics as the kserve reported latency metrics do not take into consideration additional overhead that might be added.
The queries should be reusable so that we can report on specific time periods when needed. If we switch to another tool (T404171: Evaluate Sloth as a possible replacement for Pyrra) that can cover this need then this will be an ad hoc request which we will not be using further. Otherwise we can evaluate the option of turning these queries in a Grafana Dashboard that we can use for multiple services