Page MenuHomePhabricator

Create a navtiming processor for Prometheus
Closed, ResolvedPublic

Description

We currently aggregate our Navigation Timing data from real-time EventLogging beacons in two ways:

  • Via webperf/navtiming: EventLogging subscriber on our server writing to Statsd/Graphite (also subject to whisper aggregation).
  • Via Coal: EventLogging subscriber directly on the Graphite server writing directly to disk as a custom backend that is not aggregated by Statsd, and not aggregated by Whisper.

Where webperf/navtiming also generates per-minute percentiles (via Statsd), the Coal logger only produces medians.

Statsd/Graphite has lots of features and is pretty scalable, but does so at the cost of lossy aggregation.

Coal on the other hand is essentially without aggregation, except its own (5-minute moving median). No further aggregation occurs.

Coal was created by @ori in 2015 specifically for Navigation Timing. Since then, SRE has deployed Prometheus (https://prometheus.io/), which (unlike Graphite) has support for storing time series data without aggregation and reliable percentiles.

Benefits:

  1. Simplify our software stack (by not having coal and coal-web hosted on the Graphite machine, per T158837).
  2. Open up exciting features in Grafana that are only available to non-aggregated backends, such as Histogram, Heatmap and more.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle renamed this task from Consider replacing Coal with use of Prometheus to Create a navtiming processor for Prometheus.May 10 2018, 12:48 PM

We've refactored coal away from ZMQ to Kafka, and made it runnable separately from Graphite. The process has also been migrated from the graphite hosts to the webperf hosts. Details about that at T159354.

This task (About Coal/Prometheus) was meant to be an alternative to T159354, but we've decided to keep Coal around for now, and instead have the Prometheus processor for Navigation Timing be its own thing so that we can run them side-by-side.

I've updated this task to be about setting up that new processor (rather than converting Coal to become it). This also means it no longer is part of the consolidation effort, but rather just its own new thing that we'll host where it makes sense (probably webperf#1)

Change 534771 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[performance/navtiming@master] Expose performance survey as Prometheus metrics

https://gerrit.wikimedia.org/r/534771

Moving to Inbox for re-triage. @fgiunchedi asked about navtiming prometheus integration.

Krinkle lowered the priority of this task from High to Low.

Next step: Figure out how we can stage and test this on Beta Cluster (and Labs Grafana).

Change 534771 merged by jenkins-bot:
[performance/navtiming@master] Expose handlers counters as Prometheus metrics

https://gerrit.wikimedia.org/r/534771

Change 572141 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] Scrape webperf Prometheus metrics

https://gerrit.wikimedia.org/r/572141

Change 572141 merged by Filippo Giunchedi:
[operations/puppet@production] Scrape webperf Prometheus metrics

https://gerrit.wikimedia.org/r/572141

We're on, webperf metrics are being collected in Prometheus now! Thanks to everyone involved @Gilles @dpifke @Krinkle, there's of course followup work to do but at least now we should be able to compare metrics with coal

Change 682886 had a related patch set uploaded (by Gilles; author: Gilles):

[performance/navtiming@master] Collect Paint Timing with Prometheus

https://gerrit.wikimedia.org/r/682886

Change 682886 merged by jenkins-bot:

[performance/navtiming@master] Collect Paint Timing with Prometheus

https://gerrit.wikimedia.org/r/682886

Change 683484 had a related patch set uploaded (by Gilles; author: Gilles):

[performance/navtiming@master] Remove “site” field from FID and PaintTiming

https://gerrit.wikimedia.org/r/683484

Change 683484 merged by jenkins-bot:

[performance/navtiming@master] Remove “site” field from FID and PaintTiming

https://gerrit.wikimedia.org/r/683484

During the graphite failover today (T247963) I noticed navtiming is still sending statsd and didn't refresh its DNS (hence I had to restart navtiming). Do we still need statsd sending at this stage or it can be deprecated?

New metrics are going solely to Prometheus, and some older ones are being sent to Prometheus in addition to statsd, but a lot of our dashboards still depend on statsd/Graphite.

I don't think we have a particular timeframe for statsd to go away completely, as other services such as WebPageTest still use it exclusively.

New metrics are going solely to Prometheus, and some older ones are being sent to Prometheus in addition to statsd, but a lot of our dashboards still depend on statsd/Graphite.

Good to know there's dual statsd/prometheus for every metric! Not urgent but statsd/graphite (hosted by us, see below) will go away eventually, the biggest blocker ATM being mediawiki

I don't think we have a particular timeframe for statsd to go away completely, as other services such as WebPageTest still use it exclusively.

That's ok, WPT graphite isn't hosted by us (and it is a separate datasource in grafana).

Aklapper added a subscriber: dpifke.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

I'll start with this when finishing the BitBar setup. My first thing will be go through the things we have today and then sync which labels we should use before I move on with the implementation.

(Unless I'm mistaken, there are pre-processing steps outside of Prometheus, right? If not, please ignore ^)

@Ottomata yes the pre-processing steps are done in https://gerrit.wikimedia.org/r/plugins/gitiles/performance/navtiming/+/refs/heads/master/navtiming/__init__.py

I was thinking as first step to continue to use that and just move the data from Graphite to Prometheus, that way we can get rid off Graphite when we verified that everything is the way we want in Prometheus?

+1, easier than writing new code!

I guess that would just be reforming the output metrics and running a http prometheus endpoint in navtiming to serve the metrics? Or will you push to Prometheus Push Gateway.

BTW, we've talked in the past about standardizing the way we represent metrics and dimensions (AKA labels in prometheus) in Event Platform, but never totally settled because we haven't used this. Maybe this would be a schema convention with some custom annotations on fields in the jsonschema, e.g.

browser_family:
  type: string

first_paint:
  type: number    

annotations:
  metrics: [first_paint]
  dimensions: [browser_family]

Or, perhaps it would just be ingestion job specific configuration, where you specify the stream name and which fields are dimensions(labels) and metric(s) and metric name(s). However we get this information, we could then automate the ingestion from a stream in Kafka into Prometheus (and also Druid, and other systems with a similar data model).

I'd love to replace statsv with a prometheus version. That does this. This is also very relevant to Metrics Platform work

The navtiming processor (really, should be called "webperf" processor, since it processes all webperf events not just navtiming) already runs a Prometheus endpoint.

The prometheus endpoint on the webperf server currently exposes service health metrics and we converted the perfsurvey metrics already. This task is to convert the navtiming metrics as well. So in addition to pushing a message to statsd/graphite it would also expose it to the Prometheus endpoint that webperf:navtiming.py exposes.

The reason this task is more elaborate than the other metrics is that navtiming is a very important dataset to us, and it has a lot of fields. We'd like to take this oppertunity to rethink about the fragmentation we have, what we want long-term and what the cardinality cost of our ideal fragmentation would be. This way we have continuity over the data instead of breaking it in a few months again after the initial converstion. We'll keep populating the Graphite data for at least a year alongside it in the current fragmentation so that we have continuity there as well for reporting purposes.

Right now we have in Graphite Grafana: Search for "navig"

  • Navigation Timing (global, all public wikis, all users/skins/platforms/browsers/countries)
  • Navigation Timing by platform (logged-in or not + mobile/canonical domain)
  • Navigation Timing by browser (selected 10 with sufficient per-minute samples).
  • Navigation Timing by continent
  • Navigation Timing by country (selected 10 countries with sufficient per-minute samples)

Navigation Timing does not currently use Statsv.

Navigation Timing does not currently use Statsv.

Sorry, ya, should have said "I'd ALSO like to replace statsv"

already runs a Prometheus endpoint.

Oh, ok cool!

The reason this task is more elaborate than the other metrics is that navtiming is a very important dataset to us, and it has a lot of fields. We'd like to take this oppertunity to rethink about the fragmentation we have, what we want long-term and what the cardinality cost of our ideal fragmentation would be.

Q: If you are going to be processing this data into a new form with different dimensions anyway, would it be worth re-instrumenting navigation timing, with a new schema in which we can standardize metrics and dimension conventions? Or perhaps that's just too much work and NavigationTiming is good as it is. Just curious.

Q: If you are going to be processing this data into a new form with different dimensions anyway, would it be worth re-instrumenting navigation timing, with a new schema […]

This task is concerned about the backend processing only, but we'll likely do right before or right after a cleanup of the frontend as well. This is tracked as T295684#7865481, in which we're looking at consolidating the various EL schemas into one or two, remove some of them, and indeed might as well adopt the latest EventLogging JS method while at it. Afaik we already use the schema repo instead of legacy Meta-Wiki schemas, but we do still use the eventLog.logEvent() method instead of eventLog.submit().

Krinkle raised the priority of this task from Low to High.Nov 10 2022, 7:39 AM

My initial theoretical estimates for navtiming on prom, using the same exact fragments we have in Graphite today but combined into a single dataset that allows combining any slice of data into statistically accurate and useful aggregation in Prometheus, would result in a cardinality of ~45M time series for our ~18 metrics.

Metrics (exact list may evolve over time suffices for cardinality estimation):

navtiming_responseStart
navtiming_domInteractive
navtiming_domComplete
navtiming_loadEventStart
navtiming_loadEventEnd
navtiming_delta_unload
navtiming_delta_dns
navtiming_delta_redirect
navtiming_delta_tcp
navtiming_delta_ssl
navtiming_delta_request
navtiming_delta_response
navtiming_delta_processing
navtiming_delta_onLoad
navtiming_delta_gaps
painttiming_firstPaint
painttiming_firstContentfulPaint
mw_mediaWikiLoadEnd

Fragments in Graphite:

  • 1 navtiming: overall (global)
  • 4 navtiming by Platform: 2x auth status (anon, loggedin), 2x platform (desktop, mobile)
  • 15 navtiming by Browser: 11x browser family (top 10 + other)
  • 6 navtiming by Continent: 6x
  • 11 navtiming by Country: 11x (top 10 + other)
  • 1 savetiming: overall (global)
  • 3 savetiming by wiki group: 3x train deployment groups (group0-2)
  • navtiming_oversample.*.*: (all of the above)

Effectively 40 time series x 2 for oversampling = 80, not counting the ~10 Extended properties that Statsd adds in its Graphite output as this is internal and we'll have something similar in Prometheus for buckets.

Proposed tags in Prometheus:

Where in Graphite it was conventional (and necessary) to manualy split the metric for each desired tag, this has the drawback of only allowing data to be analysed by one fragment at a time. E.g. we could plot latencies for logged-in devices, or Firefox clients, or clients in Japan, or from pageviews from a certain MW deployment version, but not any combination thereof such as "logged-in Vector clients" or "Firefox on the next MW version".

We could have overcome this limitation by emulating the Prometheus model within Graphite by combining (and thus multiplying) the fragments. However, this wouldn't have yield a useful result in practice as we operate with a sample rate that for most of the combinations would produce zero samples in any given minute. And, with Statsd's per-minute percentiles we save in Graphite, aggregation isn't meaningful here (blogpost). With histogram buckets as Prometheus encourages by default, these can be safely aggregated in a statistically sound and meaningful way (e.g. simply widen the range from 1-min to an hour or a day, no loss in accuracy).

  • 2 mw_auth: same as before (anon, logged-in)
  • 5 mw_skin: replacement for the old "platform" field (mobile/desktop). Skin is more accurate because skin preference can be overriden by individual accounts, and it would allow us to compare RUM data for alternate and upcoming skins such as Vector 22 vs Legacy Vector and Timeless.
  • 15 browser_family: idem
  • 11 country: idem. I've not added "continent" separately here as it does not add cardinality given 1:1 between country and continent.
  • 3 mw_deployment_group: idem
  • 2 is_oversample: idem

Or 9900 per metric when multiplied, and *18 metrics: ~178,200.

Change 859566 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: move webperf jobs to 'ext' instance

https://gerrit.wikimedia.org/r/859566

Change 859566 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: move webperf jobs to 'ext' instance

https://gerrit.wikimedia.org/r/859566

Following up from a chat between @fgiunchedi @Krinkle and @Peter I have moved the webperf metrics scraping to the ext Prometheus instance and confirmed metrics are being scraped again. This in practice should be all transparent (esp because there isn't much usage of prometheus webperf in grafana yet).

Peter removed Peter as the assignee of this task.May 22 2023, 6:33 PM