Page MenuHomePhabricator

Move performance metrics from Graphite to Prometheus
Open, Needs TriagePublic

Description

TODO: Start with naming metrics and labels.

Event Timeline

I think a first good step is:

  1. Add First Contentful Paint as a metric with the same labels as we have in Graphite, so we do the same kind of dashboards and can verify that we can create an alert for p75.
  2. Let Timo/Aaron verify that it looks ok, then move on with the navigation timing metrics
  3. Start changing some of the main dashboards (keep multiple dashboards so we can verify the metrics)
  4. Go through the rest of the metrics and add them.
  5. Add new metrics like Largest Contentful Paint, CPU long tasks etc.

Change 848293 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Send First Contentful Paint (FCP) metrics to Prometheus.

https://gerrit.wikimedia.org/r/848293

Adding some feedback from yesterdays meeting:

  • Go through the countries we have today, maybe we can increase number of countries?
  • Add a couple of more labels:

I got a feeling people where worried about cardinality, I can make a doc where it's calculated later on.

Did we talk about namespaces, I forgot to write that down. Maybe we could use namespace and use a couple and then just keep the rest as ... others? That way we can separate article views, that would be nice I think.

Change 849043 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Make action and namespace available to metrics.

https://gerrit.wikimedia.org/r/849043

Change 848293 merged by jenkins-bot:

[performance/navtiming@master] Send First Contentful Paint (FCP) metrics to Prometheus

https://gerrit.wikimedia.org/r/848293

Change 860078 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[schemas/event/secondary@master] painttiming: Add missing action and namespace.

https://gerrit.wikimedia.org/r/860078

Change 860127 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[mediawiki/extensions/NavigationTiming@master] Collect skin, namespaceId and action for paint timings.

https://gerrit.wikimedia.org/r/860127

Change 860078 merged by jenkins-bot:

[schemas/event/secondary@master] painttiming: Add missing action and namespace

https://gerrit.wikimedia.org/r/860078

Change 860127 merged by jenkins-bot:

[mediawiki/extensions/NavigationTiming@master] Collect skin, namespaceId and action for paint timings.

https://gerrit.wikimedia.org/r/860127

Change 867573 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Add navigation timing metrics to Prometheus.

https://gerrit.wikimedia.org/r/867573

Change 849043 merged by jenkins-bot:

[performance/navtiming@master] Add action, namespace, group, skin to fcp, lcp and cls.

https://gerrit.wikimedia.org/r/849043

Change 868658 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Remove deprecated first contentful paint.

https://gerrit.wikimedia.org/r/868658

Change 868658 merged by jenkins-bot:

[performance/navtiming@master] Remove deprecated first contentful paint.

https://gerrit.wikimedia.org/r/868658

Change 867573 merged by jenkins-bot:

[performance/navtiming@master] Add navigation timing metrics to Prometheus

https://gerrit.wikimedia.org/r/867573

Change 870079 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Add TTFB and loadEventEnd to Prometheus.

https://gerrit.wikimedia.org/r/870079

Change 870079 merged by jenkins-bot:

[performance/navtiming@master] Add TTFB and loadEventEnd to Prometheus.

https://gerrit.wikimedia.org/r/870079

Change 875462 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Disable loadEventEnd and ttfb.

https://gerrit.wikimedia.org/r/875462

Change 875462 merged by jenkins-bot:

[performance/navtiming@master] Disable loadEventEnd and ttfb.

https://gerrit.wikimedia.org/r/875462

Change 875887 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/puppet@production] prometheus: recording rules for webperf metrics.

https://gerrit.wikimedia.org/r/875887

Change 875887 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: recording rules for CPU benchmark.

https://gerrit.wikimedia.org/r/875887

So want to go back to the use cases of the data. For me I see at least two use cases:

  1. We want to mimic the metrics that we use for alerts in Graphite. Today we alert on the main buckets for all traffic. We need to separate them with metrics for desktop/mobile, authenticated or not and then I've also used those metrics by browser (type?) to see if the metric is browser dependent.
  2. With moving to Prometheus it would be great to also see metrics by country, so we can better see the performance on country level, that can help us showing how we are doing and compare our performance in US and other countries.

Is there other use cases?

Change 881411 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/puppet@production] prometheus: recording rules for CPU benchmark without labels

https://gerrit.wikimedia.org/r/881411

Change 881411 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: recording rules for CPU benchmark without labels

https://gerrit.wikimedia.org/r/881411

Change 881632 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[operations/puppet@production] prometheus: remove recording rule for CPU benchmark.

https://gerrit.wikimedia.org/r/881632

Change 881636 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Enable domInteractive and onLoad.

https://gerrit.wikimedia.org/r/881636

Change 881636 merged by jenkins-bot:

[performance/navtiming@master] Enable domInteractive and onLoad for Prometheus.

https://gerrit.wikimedia.org/r/881636

Change 883839 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Enable mediaWikiLoadEnd, tcp, dns and redirect metrics Prometheus.

https://gerrit.wikimedia.org/r/883839

Change 884293 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Remove user agent and battery and add platform for CPU benchmark.

https://gerrit.wikimedia.org/r/884293

Change 884293 merged by jenkins-bot:

[performance/navtiming@master] Remove user agent and battery and add platform for CPU benchmark.

https://gerrit.wikimedia.org/r/884293

Change 881632 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove recording rule for CPU benchmark.

https://gerrit.wikimedia.org/r/881632

Change 883839 merged by jenkins-bot:

[performance/navtiming@master] Enable mediaWikiLoadEnd, tcp, dns and redirect metrics Prometheus.

https://gerrit.wikimedia.org/r/883839

Change 886060 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Add Prometheus counters for Network type and effective type.

https://gerrit.wikimedia.org/r/886060

Change 886802 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/navtiming@master] Split navigation timing errors into below zero and wrong order.

https://gerrit.wikimedia.org/r/886802

Change 886802 merged by jenkins-bot:

[performance/navtiming@master] Split navigation timing errors into below zero and wrong order.

https://gerrit.wikimedia.org/r/886802

Change 886060 merged by jenkins-bot:

[performance/navtiming@master] Add Prometheus counters for Network type and effective type.

https://gerrit.wikimedia.org/r/886060