When we have the data, we should use it in our alerts.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T228380 Tech debt: sunsetting of Graphite | |||
| Open | None | T205870 Fully migrate producers off statsd | |||
| Open | None | T319329 Expand navigation timing metrics to include user experience metrics and modernise navigation timing | |||
| Resolved | Krinkle | T175087 Create a navtiming processor for Prometheus | |||
| Open | None | T321398 Move performance metrics from Graphite to Prometheus | |||
| Open | None | T325282 Update navtiming Grafana alerts to use metrics from Prometheus |
Event Timeline
I think I need your help here @Krinkle - I've been looking at some metric and having a hard time to know exactly how we should move on. Your trick with max_over_time doesn't work on histograms I think?
Using timings for the X percentile seems to vary too much when using percentiles, the gap is too large, And I guess that's because estimates so we are a little bit far off compared to metrics we use in Graphite.
Looking at % of users that are faster than X looks better. Here's an example of the difference with one week back for users that has first contentful paint faster than 1s. Negative percentage means we had more users within the limit last week than this week.
During that top we had an increase in TTFB that affected FCP. Maybe we can smooth out that data more to make it easier. Do you have any ideas/tricks we can use @Krinkle
I believe the basis for the above screenshot is the "Fast Contentful Paint under 1s" graph on https://grafana.wikimedia.org/d/pKbpxs54A/navigation-timing-in-prometheus.
I've been looking at some metric and having a hard time to know exactly how we should move on. Your trick with max_over_time doesn't work on histograms I think?
I think you're right that max_over_time doesn't work for this. In this case we don't need it I think. We can do a diff by substracting one metric from the other, like so:
sum(increase({le="0.1"}[24h])) / sum(increase({le="+Inf"}[24h])) # % of events under 1s in a 24h window - sum(increase({le="0.1"}[24h] offset 7d)) / sum(increase({le="+Inf"}[24h] offset 7d)) # previous week
To make good use of Prometheus benefits, I've increased the percentage window to 24h since this is safe to do in Prometheus (without it becoming a fuzzy unweighted average like in Graphite). I've also decreased the interval to 1h as we don't need a per-minute data point here for every relative 24h window. One per hour suffices.
It seems to swing by about ± 0.5% under normal cirumstances. Not too bad. Could be tweaked further :)
I can try that tomorrow, it seems to work ok for all traffic. Switching to other countries/continent except the US we get higher difference (4-10%), but starting with the overall traffic is a good thing.
Added one alert for First Contentful Paint in https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts and then a couple of Graphs for FCP and LCP for all traffic + specific to India. We can add so many alerts now, we need decide which one we should actually implement.

