statsd-exporter by default uses [.005, .01, .025, .05, .1, .25, .5, 1, 2.5 ] second buckets for histograms. If we want more, we'll have to configure them.
What are the expected buckets for MediaWiki histograms?
statsd-exporter by default uses [.005, .01, .025, .05, .1, .25, .5, 1, 2.5 ] second buckets for histograms. If we want more, we'll have to configure them.
What are the expected buckets for MediaWiki histograms?
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
profile::mediawiki::common: set default histogram buckets | operations/puppet | production | +4 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T228380 Tech debt: sunsetting of Graphite | |||
Open | None | T205870 Fully migrate producers off statsd | |||
Open | None | T315403 Framework for running experiments on a subset of the app server fleet | |||
Resolved | colewhite | T240685 MediaWiki Prometheus support | |||
Resolved | herron | T344751 Decide on default histogram buckets for MediaWiki timers |
It looks like current metrics in graphite are p50, p75, p95, and p99.
Probably this set is the bare minimum. Any other percentiles worth gathering?
Those numbers are for summary quantiles, not histograms buckets. Summaries aren't aggregatable and in almost all cases regarding multi-instance (e.g. multiple servers, multiple pods) metrics you don't want to deal with them. Almost all efforts to run aggregation queries over them will result in wrong results.
Histograms, which is what we want to deal, per the title of the ticket, have the following default buckets per https://github.com/prometheus/statsd_exporter#global-defaults. buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5 ]. unit is seconds. They are almost definitely not enough. We have pages that take in excess of 15-20 seconds to parse and I am willing to get even more. I 'll defer to @Krinkle for a more thorough investigation if they feel like it, but I 'd want at least another 2 buckets at 30s and 60s. I also doubt the 5ms bucket will be useful for MediaWiki, but I could be wrong.
Change 954114 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] profile::mediawiki::common: set default histogram buckets
Uploaded the above to get the ball rolling on a patch. As a starting point it is essentially borrowing the values used for benthos mw accesslog [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60]
Due to limitations on per-minute percentiles in Graphite, timing values are relatively rare in MediaWiki today. The vast majority of the metrics are counters.
The handful of timing metrics we have (and actively make use of) vary a lot in their range. I'm not sure a single set can be of much use. The size and range of timing measures vary a lot throughout the platform, from job measures in the range between whole seconds and hours, to WANCache callbacks that are measures between 0.1 and 100 milliseconds.
If I understand correctly, Prometheus client represent histograms as counters on the wire, so it's mostly a concept that exists outside Prometheus (local to stats-producing clients, and to consumer frontends; but not in-between). As such, is there precedent for users of statsd-exporter to produce their own histograms? For example, how strange would it be for MW-Stats to expand a timing metric into multiple {le=…} values similar to how a Prometheus client would? Then the ranges could be defined by MediaWiki callers instead.
I don't object to the proposed default, as long as there's a way for MediaWiki code to specify its own.
As written it seems like it wouldn't be a default, but rather the only way, with overrides being configured within WMF production wiring, as opposed to provided by the software. That imho violates the separation of concerns and wouldn't scale for other MW users to know about and keep in sync across core and hundreds of extension repos, and across major version upgrades.
The range and sizes of buckets in the histogram can be defined per metric (actually group of metrics, e.g. via a regex). We already use this a lot, e.g. in various services in WikiKube, where each service configures statsd-exporter per metric they want. It is not possible to do this on demand, as in allow the producer to change it, on the fly, without shipping a config change.
So the value discussed is a default, we 'll just need the list of metrics and their suggested buckets to configure statsd-exporter for metrics that wouldn't be fitting that default.
with overrides being configured within WMF production wiring, as opposed to provided by the software. That imho violates the separation of concerns and wouldn't scale for other MW users to know about and keep in sync across core and hundreds of extension repos, and across major version upgrades.
That's a good point btw. Sorry for not replying to this previously. We might want to document this fitting to WMF production very clearly.
This is indeed a wide range.
If I understand correctly, Prometheus client represent histograms as counters on the wire, so it's mostly a concept that exists outside Prometheus (local to stats-producing clients, and to consumer frontends; but not in-between).
Correct. Metric typing is handled internally by the Prometheus client. In this case, a metric is known as a histogram or summary to the Prometheus metric instances within statsd-exporter. Any other handling of histograms is handled at query time (and these are still a experimental features afaik).
As such, is there precedent for users of statsd-exporter to produce their own histograms? For example, how strange would it be for MW-Stats to expand a timing metric into multiple {le=…} values similar to how a Prometheus client would? Then the ranges could be defined by MediaWiki callers instead.
This might work if StatsLib would do it and forego letting statsd-exporter handle it.
I don't object to the proposed default, as long as there's a way for MediaWiki code to specify its own.
In order to accomplish this, StatsLib would have to handle it. One of the reasons we embarked on this path was to avoid having to make regular statsd-exporter config changes or operate a heavily-customized config.
Interestingly, if StatsLib creates executeTiming_seconds_bucket as a counter and executeTiming_seconds as a timer and sends them to the exporter, this renders statsd-exporter inoperable until a restart is commanded. It appears we have to be careful not to step on the metric names as statsd-exporter would generate them for summaries and histograms.
I propose we implement a mixed approach.
The pros as I see them:
The cons as I see them:
Any concerns with this approach?
If not, any objection to using [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60] as the statsd-exporter set of buckets @herron proposed above?
@colewhite Does the above imply that these buckets would actually be used intentionally? Do we anticipate non-zero metrics emitted by MediaWiki from the new statsd library, that use type "timing" (rather than histogram) and which then rely on getting mapped to a histogram by WMF server provisoning of its statsd exporter?
I can't think of a metric where we'd want to rely on that. If we agree on that, and if dropping or failing on such such metrics is not practical (i.e. would require custom code in statsd-exporter), then I suppose any kind of default bucket is fine. We wouldn't rely on it for any visuals or alerts anyway. It might help with (late) discovery of such mistake, however, if the defaults are more obviously "bad" (e.g. only [0.001] or some such), so as to avoid actually starting to rely on specific WMF configuration in MediaWiki core metrics. This not only avoids a culture in which metrics are first shipped for WMF and later (i.e. never) fixed "for third parties". It also avoids a situation in which it would be a complex cross-cutting breaking change to remove a bucket for you. Once a bucket exists, it generally starts to be relied upon very quickly in dashboards and/or alerts. They are easy to add, but hard to remove.
Timing metrics are not very common in MW code, and placing these at local edges in the code would ease their maintenance burden and discoverability/change-ability. If we think it becomes a burden for instrumentation, we could after a while add a handful of presets in the MW wiring for common Histogram ranges exposed via a constant. Something to keep in the back pocket perhaps?
Per meeting with @Krinkle today:
@herron the outcome of this meeting indicates this is the statsd exporter configuration we want to launch with:
--- defaults: timer_type: histogram mappings: []
Thanks SGTM overall, I'll propose just one amendment
Since these 2 buckets were requested specifically I'd be inclined to either append them to the default set, or document the rationale for omitting them. [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 30, 60, (+Inf)] seems fair IMO
I'd prefer the latter, especially since it looks like we'll be defining buckets in a few places. It'd also avoid needing to research what/where the defaults are, and offer some light insulation from defaults changing for some reason.
Also perhaps a sub/related task would be worthwhile to continue the conversation about histogram implementation details in MW, meanwhile we could wrap up this "decide on default buckets" task using the prom defaults plus 30s, 60s?
I have no problem keeping the 30s and 60s buckets. I was under the impression from the meeting that @Krinkle preferred to omit them?
I'd prefer the latter, especially since it looks like we'll be defining buckets in a few places. It'd also avoid needing to research what/where the defaults are, and offer some light insulation from defaults changing for some reason.
I'm ok with this.
Also perhaps a sub/related task would be worthwhile to continue the conversation about histogram implementation details in MW, meanwhile we could wrap up this "decide on default buckets" task using the prom defaults plus 30s, 60s?
This is T348796: MediaWiki: Define new metric type - Histogram
Perfect, I've updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/954114 to reflect this and I think with a +1 from @Krinkle we'll be good to go.
Change 954114 merged by Herron:
[operations/puppet@production] profile::mediawiki::common: set default histogram buckets