Page MenuHomePhabricator

Update navtiming dashboards to use Prometheus metrics
Open, MediumPublic

Description

Lets convert the performance dashboards that uses Graphite to use Prometheus instead.

Event Timeline

Today we have an issue for users with TTFB and I just verified that we can see the same thing with our Prometheus metrics. Looking at TTFB 75p and compare with one/two weeks back:

Screenshot 2023-03-14 at 09.05.40.png (862×2 px, 428 KB)

And also in the dashboard where we calculate how many users gets a faster experience than X

Screenshot 2023-03-14 at 09.06.55.png (716×2 px, 407 KB)

Screenshot 2023-03-14 at 09.06.45.png (1×2 px, 754 KB)

We can see here that across the board we don't reach the same level as one week ago.

Krinkle renamed this task from Update dashboards to use Prometheus metrics to Update navtiming dashboards to use Prometheus metrics.Mar 10 2024, 2:46 AM
Krinkle closed this task as Resolved.
Krinkle assigned this task to Peter.
Krinkle triaged this task as Medium priority.
Krinkle updated Other Assignee, added: Krinkle.
Krinkle added a project: NavigationTiming.
Krinkle subscribed.
Peter removed Peter as the assignee of this task.Mar 12 2025, 2:35 PM

I don't think all dashboards need to be migrated and there's no team that is responsible for that data right now. I added T389321 for removing them. We will miss out of some data (it's not a 1:1 match with the data we have in Prometheus) but that is ok for now then I guess.

For the Prometheus dashboards I think there's some work that needs to be done: Where we use "estimations" on p75 for metrics we need to make sure our alerts use percentage of users. I think the one we have today do, but that needs to be verified. The other thing is that the graphs using Prometheus is slow. Right not most graph do not work if you go back in time more than 30 days, it will just time out.

Change #1133152 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/alerts@master] perf/navtiming: migrate alerts from grafana to alertmanager

https://gerrit.wikimedia.org/r/1133152

Change #1133152 merged by jenkins-bot:

[operations/alerts@master] perf/navtiming: migrate alerts from grafana to alertmanager

https://gerrit.wikimedia.org/r/1133152

Change #1135326 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/alerts@master] perf/navtiming: migrate alerts from grafana to alertmanager

https://gerrit.wikimedia.org/r/1135326

Change #1135393 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/alerts@master] perf/navtiming: Add LoadEventEnd alert to alertmanager

https://gerrit.wikimedia.org/r/1135393

Change #1135408 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/alerts@master] perf/navtiming: Add CPU long task alert to alertmanager

https://gerrit.wikimedia.org/r/1135408

Change #1135326 merged by jenkins-bot:

[operations/alerts@master] perf/navtiming: Add FCP alert to alertmanager

https://gerrit.wikimedia.org/r/1135326

Change #1135393 merged by jenkins-bot:

[operations/alerts@master] perf/navtiming: Add LoadEventEnd alert to alertmanager

https://gerrit.wikimedia.org/r/1135393

Change #1135408 merged by jenkins-bot:

[operations/alerts@master] perf/navtiming: Add CPU long task alert to alertmanager

https://gerrit.wikimedia.org/r/1135408