Page MenuHomePhabricator

Discrepancy between Graphite & Prometheus editResponseTime counts
Open, HighPublic

Description

I saw that T354905: migrate MediaWiki.timing.editResponseTime to statslib had been resolved for some time now, so I looked at converting the "Successful wiki edits" panels on the Grafana front page & www.wikimediastatus.net to use the version from Prometheus.

The original Graphite metric used was MediaWiki.timing.editResponseTime.sample_rate.

As best I can tell this ought to correspond to a sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[5m])) query against Thanos.

However, comparing the results, the Prometheus metric is approx half the expected value:
https://grafana.wikimedia.org/goto/sBKcZCBIg

Am I misunderstanding something or is there something wrong?

Event Timeline

Thanks for the report!

I'd hypothesize this is because Prometheus stats ingestion is not yet enabled on k8s hosts. The per-pod deployment strategy is convenient, but we've been concerned about turning it on in light of T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops

We're coordinating with ServiceOps to redesign the exporter deployment on k8s. How we do this should also take into account: T359497: StatsD Exporter: gracefully handle metric signature changes.

Indeed I agree that would be the root cause @colewhite pointed out. In light of the fact that (as far as I'm aware) we don't have an ETA to tweak the statsd-exporter deployment on wikikube as described in T359640; I think we should go back to the graphite/statsd metric for edits, so numbers are accurate

Now that T365265 is nearing completion, this may be worth another look, @CDanis?

I took another look today and summing by kubernetes_namespace yields quite similar results to graphite for mw-api-ext, there's also ~3-4 edit/s from mw-web though: https://grafana.wikimedia.org/goto/DLw3vxvNg . I'm not sure that's enough to explain the discrepancy. At any rate (hah!) unless we are double-counting edits on the Prometheus side (which could also explain) I'm tempted to trust its values more at this point. Something else we could do is count edits via other means and/or proxy metrics (changeprop for example? the kafka topic(s)?)