Page MenuHomePhabricator

Create a navtiming processor for Prometheus
Open, LowPublic

Description

We currently aggregate our Navigation Timing data from real-time EventLogging beacons in two ways:

  • Via webperf/navtiming: EventLogging subscriber on our server writing to Statsd/Graphite (also subject to whisper aggregation).
  • Via Coal: EventLogging subscriber directly on the Graphite server writing directly to disk as a custom backend that is not aggregated by Statsd, and not aggregated by Whisper.

Where webperf/navtiming also generates per-minute percentiles (via Statsd), the Coal logger only produces medians.

Statsd/Graphite has lots of features and is pretty scalable, but does so at the cost of lossy aggregation.

Coal on the other hand is essentially without aggregation, except its own (5-minute moving median). No further aggregation occurs.

Coal was created by @ori in 2015 specifically for Navigation Timing. Since then, SRE has deployed Prometheus (https://prometheus.io/), which (unlike Graphite) has support for storing time series data without aggregation and reliable percentiles.

Benefits:

  1. Simplify our software stack (by not having coal and coal-web hosted on the Graphite machine, per T158837).
  2. Open up exciting features in Grafana that are only available to non-aggregated backends, such as Histogram, Heatmap and more.

Event Timeline

Krinkle created this task.Sep 5 2017, 10:25 PM

Thanks @Krinkle for starting this!

Implementation wise it should be easy to add the Prometheus python client to existing daemons and open up ports where metrics will be scraped by Prometheus, while keeping existing statsd/coal metrics during the transition.

Retention wise, we're running several Prometheus instances with different retention policies. The global instance aggregates data from all other instances and has ATM one year retention, though I believe we can reasonably extend it even further since it is essentially bound by how much disk space and memory there is available.

Peter added a subscriber: Peter.Sep 6 2017, 3:43 PM
Peter awarded a token.Sep 27 2017, 7:50 PM

@fgiunchedi I'm looking into Prometheus and have got a couple of questions.

The end result I'd like to aim for is to have our ± 10 metrics from Navigation Timing available, through Prometheus, in Grafana with the following capabilities:

  1. Produce a single value representing the median, p75, or p95 over a given period of time (e.g. p95 of last month).
  2. Produce a series of values representing the median, p75 or p95 per given interval (e.g. per 15min, per hour or per day).
  3. Produce a histogram or heatmap that considers all original values.
  4. The previous 3 use cases, but filtered by labels with limited pre-defined values such as browser family and country code.

These first two cases we try currently via Graphite, but this fails somewhat because the raw data is lost. So we end up averaging per-minute percentiles (which is horrible). Even if we change the interval for NavTiming in Graphite to per-second instead of per-minute, this would make the data not statistically meaningful. We could improve the current situation by manually aggregating into manually-crafted buckets, but it seems Prometheus would be a better option.

However, I'm somewhat unclear on how we should use Prometheus.

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Our metrics currently have five labels:

  • platform (2 values: mobile, desktop)
  • auth (2 values: anonymous, authenticated)
  • browser (one of < 12 handpicked browser names, or "Other")
  • continent (one of 6 values, or null)
  • country (one of 8 values, or null)

Would that be fine?

Metric type

It sounds like Summaries is similar to what we do now – calculate the percentiles per-minute in StatsD (before sending to Graphite) at which points it becomes a single aggregated value, not appropiate for further aggregation.

Histograms seem attractive given the query flexibility after the fact. Although the warning "not aggregatable" only applying to Summaries on that page, it sounds to me like it should apply to Histogram as well, given that it will not accurately tell you the median over 3 days in a way that considers all original values. It can accurately say in which bucket the value is (which is an improvement over our current Statsd/Graphite approach), but it not being able to produce an actual number makes it hard to use for our use case, which is to catch small regressions in individual features and products, not just whether we're still meeting an SLA or not.

Given Prometheus' approach to not lose original input data, I'm leaning towards using a Gauge. Assuming the "we don't aggregate" principle applies there, that would mean all original data points are preserved and could be queried/aggregated upon request. Although naturally I would imagine cannot be very good for query performance. Right?

However, I'm somewhat unclear on how we should use Prometheus.

As a general guideline, try to keep the cardinality of your metrics below 10, and for metrics that exceed that, aim to limit them to a handful across your whole system. The vast majority of your metrics should have no labels.

Our metrics currently have five labels:

  • platform (2 values: mobile, desktop)
  • auth (2 values: anonymous, authenticated)
  • browser (one of < 12 handpicked browser names, or "Other")
  • continent (one of 6 values, or null)
  • country (one of 8 values, or null)

Would that be fine?

Yes that looks like it would be fine, what metrics are you tracking ? Do they all share those labels above?

I'm asking because it some cases it might even make sense to move one of the labels with low cardinality into the metric name itself. e.g. if there are metrics that make sense only for anonymous or only for non-anonymous.

Metric type

It sounds like Summaries is similar to what we do now – calculate the percentiles per-minute in StatsD (before sending to Graphite) at which points it becomes a single aggregated value, not appropiate for further aggregation.

Correct, Summaries are like what we're currently doing with statsd

Histograms seem attractive given the query flexibility after the fact. Although the warning "not aggregatable" only applying to Summaries on that page, it sounds to me like it should apply to Histogram as well, given that it will not accurately tell you the median over 3 days in a way that considers all original values. It can accurately say in which bucket the value is (which is an improvement over our current Statsd/Graphite approach), but it not being able to produce an actual number makes it hard to use for our use case, which is to catch small regressions in individual features and products, not just whether we're still meeting an SLA or not.

I believe the "non aggregatable" in general refers to the fact that the aggregated results are still statistically meaningful. Note that aggregation here means if the same metric is pulled from more than one "target" (as prometheus calls it, a machine, a service, etc) to be then further processed by aggregation functions in prometheus. In the navtiming case ATM there's only one target from which the metrics are pulled, namely the machine where all measurements are funneled into. Meaning you could also use summaries, but it would break down in the case of multiple machines processing navtiming data.

WRT to catching small regressions, you can "zoom in" the histogram buckets close to the latency target you're interested in and have additional precision that way. In particular I'm referring to something like the example case discussed here: https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation.

Given Prometheus' approach to not lose original input data, I'm leaning towards using a Gauge. Assuming the "we don't aggregate" principle applies there, that would mean all original data points are preserved and could be queried/aggregated upon request. Although naturally I would imagine cannot be very good for query performance. Right?

That's correct, in general Prometheus isn't designed to store all observations (e.g. like a billing system) and query for exact results. Unfortunately the terminology gets confusing fast here (in the navtiming case the "input data" are also "datapoints", etc), but yeah it isn't feasible to store each individual navtiming observation sent by clients as distinct datapoints in Prometheus. Derived metrics from all input data are fine to store of course like discussed above.

You could do both though, i.e. have aggregated/derived metrics from the input data as Prometheus metrics and store all received observations e.g. in hdfs to be queried via hive when exact results are needed.

You could do both though, i.e. have aggregated/derived metrics from the input data as Prometheus metrics and store all received observations e.g. in hdfs to be queried via hive when exact results are needed.

As another example of possible pipeline involving storing and querying all input data, we're looking at storing netflow data in kafka and then into druid/hdfs for querying (T181036)

Change 427664 had a related patch set uploaded (by Imarlier; owner: Imarlier):
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

Change 427664 merged by jenkins-bot:
[performance/coal@master] coal: convert to using graphite instead of writing to whisper

https://gerrit.wikimedia.org/r/427664

Krinkle renamed this task from Consider replacing Coal with use of Prometheus to Create a navtiming processor for Prometheus.May 10 2018, 12:48 PM

We've refactored coal away from ZMQ to Kafka, and made it runnable separately from Graphite. The process has also been migrated from the graphite hosts to the webperf hosts. Details about that at T159354.

This task (About Coal/Prometheus) was meant to be an alternative to T159354, but we've decided to keep Coal around for now, and instead have the Prometheus processor for Navigation Timing be its own thing so that we can run them side-by-side.

I've updated this task to be about setting up that new processor (rather than converting Coal to become it). This also means it no longer is part of the consolidation effort, but rather just its own new thing that we'll host where it makes sense (probably webperf#1)

Change 534771 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[performance/navtiming@master] Expose performance survey as Prometheus metrics

https://gerrit.wikimedia.org/r/534771

Krinkle moved this task from Backlog: Future Goals to Inbox on the Performance-Team board.EditedOct 21 2019, 8:07 PM

Moving to Inbox for re-triage. @fgiunchedi asked about navtiming prometheus integration.

Krinkle claimed this task.Oct 23 2019, 7:22 PM
Krinkle lowered the priority of this task from High to Low.

Next step: Figure out how we can stage and test this on Beta Cluster (and Labs Grafana).

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Oct 23 2019, 7:23 PM
Krinkle removed Krinkle as the assignee of this task.Dec 13 2019, 2:55 PM
Krinkle removed a project: Patch-For-Review.
Krinkle moved this task from Doing to Backlog: Future Goals on the Performance-Team board.
Gilles assigned this task to dpifke.Jan 7 2020, 11:36 AM

Change 534771 merged by jenkins-bot:
[performance/navtiming@master] Expose handlers counters as Prometheus metrics

https://gerrit.wikimedia.org/r/534771

Change 572141 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] Scrape webperf Prometheus metrics

https://gerrit.wikimedia.org/r/572141

Change 572141 merged by Filippo Giunchedi:
[operations/puppet@production] Scrape webperf Prometheus metrics

https://gerrit.wikimedia.org/r/572141

We're on, webperf metrics are being collected in Prometheus now! Thanks to everyone involved @Gilles @dpifke @Krinkle, there's of course followup work to do but at least now we should be able to compare metrics with coal