Page MenuHomePhabricator

Update navtiming Grafana alerts to use metrics from Prometheus
Open, Needs TriagePublic

Description

When we have the data, we should use it in our alerts.

Event Timeline

I think I need your help here @Krinkle - I've been looking at some metric and having a hard time to know exactly how we should move on. Your trick with max_over_time doesn't work on histograms I think?

Using timings for the X percentile seems to vary too much when using percentiles, the gap is too large, And I guess that's because estimates so we are a little bit far off compared to metrics we use in Graphite.

Looking at % of users that are faster than X looks better. Here's an example of the difference with one week back for users that has first contentful paint faster than 1s. Negative percentage means we had more users within the limit last week than this week.

Screenshot 2023-03-17 at 16.16.43.png (2,648×1,186 px, 520 KB)

During that top we had an increase in TTFB that affected FCP. Maybe we can smooth out that data more to make it easier. Do you have any ideas/tricks we can use @Krinkle

I believe the basis for the above screenshot is the "Fast Contentful Paint under 1s" graph on https://grafana.wikimedia.org/d/pKbpxs54A/navigation-timing-in-prometheus.

I've been looking at some metric and having a hard time to know exactly how we should move on. Your trick with max_over_time doesn't work on histograms I think?

I think you're right that max_over_time doesn't work for this. In this case we don't need it I think. We can do a diff by substracting one metric from the other, like so:

sum(increase({le="0.1"}[24h])) / sum(increase({le="+Inf"}[24h])) # % of events under 1s in a 24h window
-
sum(increase({le="0.1"}[24h] offset 7d)) / sum(increase({le="+Inf"}[24h] offset 7d)) # previous week

Screenshot 2023-04-07 at 00.26.20.png (2,718×1,212 px, 400 KB)

To make good use of Prometheus benefits, I've increased the percentage window to 24h since this is safe to do in Prometheus (without it becoming a fuzzy unweighted average like in Graphite). I've also decreased the interval to 1h as we don't need a per-minute data point here for every relative 24h window. One per hour suffices.

It seems to swing by about ± 0.5% under normal cirumstances. Not too bad. Could be tweaked further :)

I can try that tomorrow, it seems to work ok for all traffic. Switching to other countries/continent except the US we get higher difference (4-10%), but starting with the overall traffic is a good thing.

Added one alert for First Contentful Paint in https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts and then a couple of Graphs for FCP and LCP for all traffic + specific to India. We can add so many alerts now, we need decide which one we should actually implement.

Peter removed Peter as the assignee of this task.Aug 12 2024, 1:25 PM
fgiunchedi renamed this task from Update Grafana alerts to use metrics from Prometheus to Update navtiming Grafana alerts to use metrics from Prometheus.Jan 29 2025, 3:45 PM