Page MenuHomePhabricator

Add Prometheus support to statsd.js via mw.track()
Closed, ResolvedPublic

Description

mw.track() is a frontend JS feature that collects metrics produced in browsers and forwards them to Graphite.

This task is complete when there is a migration strategy in place to migrate metrics produced by mw.track() into Prometheus.

T354907#9480808

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thank you for the proposal; it seems we have some options forward. To move us towards a decision, I’d like to raise a couple of points for consideration:

  • Is there merit in trying both approaches in parallel and allowing metric owners to choose which mechanism and data store best suits them?
  • If the path forward must be one or the other, how do we reach a consensus? Moreover, who is responsible/accountable for deciding on the approach?

I don’t intend to rush the overall discovery process or the discussion; however, I wanted to check with you to see if we have enough to work toward a decision.

allowing metric owners to choose which mechanism and data store best suits them?

FWIW, with the event platform consolidation proposal, metrics owners do not have to choose. All of these metrics will go both to Prometheus and to the Data Lake automatically.

Is there merit in trying both approaches in parallel

I'm not sure it would be worth the effort to implement both. I'd hope we can reach a decision without implementing both!

If the path forward must be one or the other, how do we reach a consensus? Moreover, who is responsible/accountable for deciding on the approach?

Goooood question! This is maybe what we should meet about. There is quite a bit of overlap between Observability and Data Platform Engineering. We both do metric and log collection, and dashboarding. Product teams have to navigate between multiple systems to find the ones that works best for them. I'd like it if we could work together to unify the backend data formats and transports, so that the data can be consumed into and used in the whichever consumer interface/serving layers (Prometheus+Grafana, Hive+Superset/Turnilo/Jupyter, etc. etc.) make the most for our use cases.

In this case, if I were to propose some owners:

  • Data Products/Metrics Platform and/or a MediaWiki team (which one?) would own the MW JS implementation.
  • Data Engineering + SRE Traffic would own /beacon/v2/event -> eventgate-analytics-external routing
  • Data Engineering and/or SRE Observability would own kafka -> prometheus consumer processing (AKA statsv.py)

I think to decide this, we'd ideally need a person (a PM?) from each of those teams.

FWIW, with the event platform consolidation proposal, metrics owners do not have to choose. All of these metrics will go both to Prometheus and to the Data Lake automatically.

This is very compelling.

I'm not sure it would be worth the effort to implement both. I'd hope we can reach a decision without implementing both!

+1

Goooood question! This is maybe what we should meet about. There is quite a bit of overlap between Observability and Data Platform Engineering. We both do metric and log collection, and dashboarding. Product teams have to navigate between multiple systems to find the ones that works best for them. I'd like it if we could work together to unify the backend data formats and transports, so that the data can be consumed into and used in the whichever consumer interface/serving layers (Prometheus+Grafana, Hive+Superset/Turnilo/Jupyter, etc. etc.) make the most for our use cases.

I completely agree that collaboration is key to moving this forward; we're eager to work on this together

In this case, if I were to propose some owners:

  • Data Products/Metrics Platform and/or a MediaWiki team (which one?) would own the MW JS implementation.
  • Data Engineering + SRE Traffic would own /beacon/v2/event -> eventgate-analytics-external routing
  • Data Engineering and/or SRE Observability would own kafka -> prometheus consumer processing (AKA statsv.py)

I think to decide this, we'd ideally need a person (a PM?) from each of those teams.

I'll work on facilitating this conversation. Thanks @Ottomata!

Change #1092256 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ResourceLoader: Support multiple mw.track() data arguments

https://gerrit.wikimedia.org/r/1092256

@Krinkle: Do you need a review for that patch and the patch(es) in the relation chain?

@phuedx I'd prefer to use this opppertunity to build knowledge/familiarity within my team for the code we maintain.

Change #1092312 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] stats: Add unit tests

https://gerrit.wikimedia.org/r/1092312

Change #1092910 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/statsv@master] Remove unused statsvr.py

https://gerrit.wikimedia.org/r/1092910

Change #1092312 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] stats: Add unit tests

https://gerrit.wikimedia.org/r/1092312

Change #1092256 merged by jenkins-bot:

[mediawiki/core@master] ResourceLoader: Support multiple mw.track() data arguments

https://gerrit.wikimedia.org/r/1092256

Change #1092910 merged by jenkins-bot:

[performance/statsv@master] Remove unused statsvr.py

https://gerrit.wikimedia.org/r/1092910

In the Growth team, we added the first sub-tasks about migrating to statslib to our next sprint. We do have quite a number of metrics that we track from javascript in the browser. From glancing at the comments and seeing some merged tasks, I struggle to infer how ready this is.

Can we start migrating the mw.track( ... ) calls now, and if so how? Or is this still blocked, and we should just push this work to some time maybe next quarter?

Some examples that we do have in code:

mw.track(
	'timing.growthExperiments.suggestedEdits.taskEditorReady.' + this.taskType +
	( OO.ui.isMobile() ? '.mobile' : '.desktop' ),
	window.performance.now() - this.startTime
);
mw.track(
	// Using 'timing' for transfer size sounds conceptually wrong, but we want
	// the various features that statsd timing gives us (see
	// https://github.com/statsd/statsd/blob/master/docs/metric_types.md)
	'timing.growthExperiments.specialHomepage.navigationTransferSize',
	navigationEntries[ 0 ].transferSize
);
mw.track( 'counter.growthExperiments.StructuredTask.noSuggestionsDialog.' + taskType );
This comment was removed by Krinkle.

@Michael The replacement is not yet ready. We aim to have this ready by end of quarter. You can plan to add (or replace) Prometheus-bound statsd.js calls starting in January.

Change #1098517 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] Make use of built-in packageFiles `config` feature

https://gerrit.wikimedia.org/r/1098517

Change #1098521 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaEvents@master] statsd: Create a `stats` mw.track topic for dogstatsd/Prometheus

https://gerrit.wikimedia.org/r/1098521

@Michael The replacement is not yet ready. We aim to have this ready by end of quarter. You can plan to add (or replace) Prometheus-bound statsd.js calls starting in January.

Thank you for the quick response! Then we will postpone that work till the new year.

Krinkle triaged this task as High priority.Nov 27 2024, 2:59 PM

Change #1099720 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] webperf: set statsv.py --statsd to statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/1099720

Change #1099796 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] webperf: disable statsd-exporter relaying flag

https://gerrit.wikimedia.org/r/1099796

Change #1099821 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] webperf: set statsd exporter timer type to histogram

https://gerrit.wikimedia.org/r/1099821

Change #1099822 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: restart statsd-exporter on config change

https://gerrit.wikimedia.org/r/1099822

Change #1098517 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Make use of built-in packageFiles `config` feature

https://gerrit.wikimedia.org/r/1098517

Change #1098521 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] statsd: Create a `stats` mw.track topic for dogstatsd/Prometheus

https://gerrit.wikimedia.org/r/1098521

Change #1099822 merged by Cwhite:

[operations/puppet@production] prometheus: restart statsd-exporter on config change

https://gerrit.wikimedia.org/r/1099822

Change #1099720 merged by Cwhite:

[operations/puppet@production] webperf: set statsv.py --statsd to statsd.eqiad.wmnet

https://gerrit.wikimedia.org/r/1099720

Change #1099796 merged by Cwhite:

[operations/puppet@production] webperf: disable statsd-exporter relaying flag

https://gerrit.wikimedia.org/r/1099796

Change #1099821 merged by Cwhite:

[operations/puppet@production] webperf: set statsd exporter timer type to histogram

https://gerrit.wikimedia.org/r/1099821

Sgs moved this task from Inbox to Tracking on the Growth-Team board.

Change #1098570 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/statsv@master] statsv: Add dogstatsd/Prometheus support

https://gerrit.wikimedia.org/r/1098570

Change #1105367 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable $wgWMEStatsBeaconUri

https://gerrit.wikimedia.org/r/1105367

Change #1105372 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] webperf: Enable --dogstatsd on statsv.py

https://gerrit.wikimedia.org/r/1105372

Change #1105423 had a related patch set uploaded (by Cwhite; author: Cwhite):

[performance/statsv@master] update kafka-python version to match webperf hosts

https://gerrit.wikimedia.org/r/1105423

Change #1098570 merged by jenkins-bot:

[performance/statsv@master] statsv: Add dogstatsd/Prometheus support

https://gerrit.wikimedia.org/r/1098570

Change #1105454 had a related patch set uploaded (by Cwhite; author: Cwhite):

[performance/statsv@master] Further limit regex to only counter type flag

https://gerrit.wikimedia.org/r/1105454

Change #1105423 merged by jenkins-bot:

[performance/statsv@master] update kafka-python version to match webperf hosts

https://gerrit.wikimedia.org/r/1105423

Oh, it's POST? With a POST body? or is the data just urlencoded query params?

If we can use POST body for the data, then my proposed new pipeline would be: […]

[…] This would allow the data to be useful in multiple places, including the ability to join with any other data in the Data Lake.

[…] for Product Analytics it would be really helpful if the data was available in data lake and could be accessed/reported with Superset (which we know how to use).

For various reasons, [instrumentation] ended up being a Grafana dashboard for the analyst to use (since we have no expertise on the team with that platform)

There are four reasons I opted to complete this task by maintaining the simpler stack we have today, rather than integrating EventGate.

  1. Compatibility: Support HTTP-GET. This is currently missing in EventGate, which would mean HTML- and CSS-based instrumentations become impossible. Most prominently this affects research on JavaScript reliability (how often JavaScript fails to execute in modern browsers) which is especially important today where the industry seems to grossly underestimate this problem. This came up in a discussion about the Graphs extension as well, where we can trivially measure this with a temporary <img>, <noscript>, and one line of JavaScript in an inline <script> tag, possibly in Parser/Parsoid HTML for a given extension tag (GOV.UK example, MediaWiki Engineering Principles). I don't mind shifting the line between "easy" and "possible" further toward "possible" for some use cases, but unless there is a way for an HTML element to make a POST request by itself, I don't see another way. The current way is <img src="/beacon/statsv?mediawiki_something_total:1|c">. What would be the equivalent of this?
  2. Punt new use cases (A: join prometheus with data lake, B: analyst unfamiliar with Grafana): I discussed this with @JTweed-WMF and we agreed this is worth exploring, but we prefer to discuss this outside the critical path of a project that is already several months behind schedule. We're open to revisiting this with a wider context, looking at the tools and options available, not now when under pressure from what is largely an orthogonal task.
  3. Short critical path for operational telemetry. It's one thing to extract data from Prometheus into a data lake. It's quite another to divert Tier 1 telemetry on the way to Prometheus into a data lake instead, and have it only become observable to alerting after it is consumed from there. We don't do this for any service, including not for MediaWiki PHP in production. I suspect SRE might not accept this for other production services either (e.g. place EventBus in-between an application and its Prometheus in-take). I've outlined the perceived regression in complexity and autonomy below. It seems quite expensive in principal as well for each numerical increase to be an entire "event". This would also increase the "dollars/CO2 cost per increment", noting that these are ultimately mere foo++ counters over a certain Prometheus interval.
  4. Deliver by end of year. As outlined below, the stack we have today is simple and collaboratively maintained between MW Eng and SRE O11y. All we need is two simple changes and we're done (one in statsd.js, one in statsv.py), which is doable by end of year. Involving EventGate at this time would not change the need for those two changes but instead require additional changes and introduce (to us) unknowns and potential delays (e.g. would we set up a new /beacon/intake-analytics endpoint on wiki domains by end of year? And add support for events in query strings?). It also commits to new use cases that I'd rather address with more care and patience at another time.

Critical path comparison

Dependency graph: Today
  • Overview: Browsernavigator.sendBeacon()HAProxyVarnishKafkastatsv.py
  • Browser: en.wikipedia.org/wiki/Main_Page
    • MediaWiki Example feature: call mw.track()
    • MediaWiki core ResourceLoader: mw.track function
    • MediaWiki statsd.js: call navigator.sendBeacon
    • The above 3 components are all in Gerrit, with its code understandable, debuggable, patchable, and deployable by both MediaWiki Engineering and SRE Observability team members.
    • Web request: /beacon
  • Traffic: LVS, HAProxy
  • Traffic: Varnish
    • 🏦 Data is saved to Kafka from here.
    • Debugging: sudo varnishlog, accessible by both SRE O11y and several MW devs.
  • SRE O11y: statsv.py
    • Debugging: kafkacat, journalctl; accessible
  • SRE O11y: statsd/Graphite
  • SRE O11y: statsd_exporter/Prometheus
Dependency graph: Current draft
  • Overview: Browsernavigator.sendBeacon()HAProxyVarnishKafkastatsv.py
  • Browser: en.wikipedia.org/wiki/Main_Page
    • MediaWiki Example feature: call mw.track()
    • MediaWiki core ResourceLoader: mw.track function
    • MediaWiki statsd.js: call navigator.sendBeacon. ✴️ Change: Update format from statsd to dogstatsd.
    • The above 3 components are all in Gerrit, with its code understandable, debuggable, patchable, and deployable by both MediaWiki Engineering and SRE Observability team members.
    • Web request: /beacon
  • Traffic: LVS, HAProxy
  • Traffic: Varnish
    • 🏦 Data is saved to Kafka from here.
  • SRE O11y: statsv.py ✴️ Change: Send dogstatsd lines to Prometheus.
  • SRE O11y: statsd/Graphite
  • SRE O11y: statsd_exporter/Prometheus
Dependency graph: Alternative
  • Overview: Browser❇️ Add: metrics-platform JSnavigator.sendBeacon()HAProxyVarnish❇️ Add: EventGateKafkastatsv.py
  • Browser: en.wikipedia.org/wiki/Main_Page
    • MediaWiki Example feature: call mw.track()
    • MediaWiki core ResourceLoader: mw.track function
    • MediaWiki statsd.js: call mw.eventLog.submit
      • ✴️ Change: Update format from statsd to dogstatsd.
      • ✴️ Change: Wrap dogstatsd lines in a to-be-created JSON schema for WMF EventGate.
    • ❇️ Add: MediaWiki EventLogging: mw.eventLog.submit function. — Involves several layers of indirection.
    • ❇️ Add: mediawiki/libs/metrics-platform. — This library is maintained on GitLab, with access limited to DataEng. Changes require 2-3 levels of releases propagation, and vendor pull-throughs.
    • The above components are all in Gerrit, and fully debuggable/patchable/deployable by MediaWiki Engineering and SRE Observability.
    • Web request: https://intake-analytics.wikimedia.org/v1/events
    • ❇️ Add: DNS intake-analytics.wikimedia.org
    • ❇️ Add: Connection to intake-analytics.wikimedia.org
  • Traffic: LVS, HAProxy
  • Traffic: Varnish
  • ❇️ Add: Traffic: ATS
  • ❇️ Add: Kubernetes load balancer
  • ❇️ Add: schema.wikimedia.org service
    • ❇️ TODO: Define and agree on a dogstatsd schema for WMF use. Own and maintain this versioned schema long-term.
    • This service and the schemas-event-primary repo reside on GitLab, with limited merge access.
  • ❇️ Add: EventGate Node.js service
    • This service is maintained on GitLab, with access limited to DataEng.
    • This service is configured and deployed via operations/deployment-charts.
    • 🏦 Data is saved to Kafka from here.
  • SRE O11y: statsv.py ✴️ Change: Send dogstatsd lines to Prometheus.
  • SRE O11y: statsd/Graphite
  • SRE O11y: statsd_exporter/Prometheus

You can see that from my perspective, there is no consolidation. No part is removed, replaced, or reduced in complexity for us. It would however add numerous layers ahead of "🏦 Data is saved to Kafka from here", and reduce ability to debug and patch key points. It also means we would involve an additional intermediary format (schema'ed EventGate event), with a schema that may be versioned and change over time. Rather than have 1 public stable format (dogstatsd spec) throughout from start to finish that won't be broken or versioned.

If the use case to join with data lake has a high priority, I suggest pulling data from Prometheus rather than intercepting it on the way to Prometheus. That way you'd not be limited to the one narrow case of MediaWiki JS stats, and it wouldn't require each instrumentation to be modified. You'd have the option to join with any Prometheus data, including e.g. from Node.js services, MediaWiki PHP, and much much more.

Conclusion: We have not precluded EventGate integration in anyway. As shown above, the changes we've made, are needed regardless.

For awareness, see also T359178#9640223 re: statsv in the context of varnishkafka deprecation/removal.

Pros: Removes varnishkafka dependency

Thank you. It looks like this applies to Fundraising /beacon/impression as well. I've subscribed myself to T370668: New software: haproxykafka, to remind us next quarter to look at haproxykafka. That would remove a layer in our stack (Varnish), and moves the point where we save the data one step earlier in the stack, both of which I certainly welcome!

Dependency graph: Next step?
  • Overview: Browsernavigator.sendBeacon()HAProxyKafkastatsv.py
  • Browser […]Web request: /beacon
  • Traffic: LVS, HAProxy
    • 🏦 Data is saved to Kafka from here.
  • Traffic: VarnishRemove
    • Data is saved to Kafka from here.
  • SRE O11y: statsv.py
  • SRE O11y: statsd/GraphiteRemove
  • SRE O11y: statsd_exporter/Prometheus

Thank you so much Timo for the details!

Punt new use cases (A: join prometheus with data lake, B: analyst unfamiliar with Grafana): I discussed this with @JTweed-WMF and we agreed this is worth exploring, but we prefer to discuss this outside the critical path of a project that is already several months behind schedule. We're open to revisiting this with a wider context, looking at the tools and options available, not now when under pressure from what is largely an orthogonal task.

This reasoning makes sense, and is the perspective I was looking for.

Change #1105372 merged by Cwhite:

[operations/puppet@production] webperf: Enable --dogstatsd on statsv.py

https://gerrit.wikimedia.org/r/1105372

Now on to some technical comments / clarifications :)

Compatibility: Support HTTP-GET.

If this is a requirement, then this reasoning makes sense.

When we embarked on on the original Modern Event Platform program that resulted in replacing the legacy EventLogging GET with a POST to EventGate, some analysis was done around the need to track metrics from browsers without JavaScript (if needed, I can try to do some phab/google doc archaeology to find the analysis), and it was determined to not be relevant. The number of requests coming from JavaScript-less clients was small enough.

Most prominently this affects research on JavaScript reliability (how often JavaScript fails to execute in modern browsers)

Ah ha, indeed. I doubt this kind of analysis was considered in the product metrics use case. Okay.

We could support GET in EventGate if we had to, via good ol' URL encoded JSON strings.


It's quite another to divert Tier 1 telemetry on the way to Prometheus into a data lake instead, and have it only become observable to alerting after it is consumed from there. We don't do this for any service, including not for MediaWiki

The proposal is not to go from Data Lake -> Prometheus. It is to go from Kafka -> Prometheus, just like statsv is doing now. MW logging also goes Kafka -> logstash elasticsearch. EventGate is used for SRE Network Error Logging, for which there is T304373: Also intake Network Error Logging events into the Analytics Data Lake.

I've got a some corrections to the "Dependency Graphs." I'll try to make them here.

**Dependency graph: Today*
Overview: Browser → navigator.sendBeacon() → HAProxy → Varnish -> Varnishkafka → Kafka → statsv.py

Varnishkafka was missing in your current pipeline. The Alternative: EventGate proposal is to replace Varnishkafka with EventGate.

Dependency graph: Alternative: EventGate
Overview: Browser → navigator.sendBeacon() → HAProxy → Varnish → ❇️ EventGate (replaces > Varnishkafka) → Kafka → statsv.py

  • Browser: en.wikipedia.org/wiki/Main_Page
    • MediaWiki Example feature: call mw.track()
    • MediaWiki core ResourceLoader: mw.track function
    • MediaWiki statsd.js: call navigator.sendBeacon
      • ✴️ Change: Update format from statsd to JSON with to-be-created opentelementry JSON event schema for WMF Event Platform
    • ✅ The above components are all in GitLab/Gerrit, and fully debuggable/patchable/deployable by MediaWiki Engineering and SRE Observability.
    • Web request: <same-site>beacon/v2/event (supported by T263049 Avoid extra HTTPS connections for most Event Platform beacons)
  • Traffic: LVS, HAProxy
  • Traffic: Varnish
  • ❇️ Add: Traffic: ATS
  • ❇️ Add: Kubernetes load balancer
  • ❇️ TODO: Define and agree on a dogstatsd schema for WMF use. Own and maintain this versioned schema long-term.
  • ❇️ Add: EventGate Node.js service
    • ❇️ Add: schema.wikimedia.org service or schema repository (this is a dependency of EventGate)
    • ❇️ Add: meta.wikimedia.org/w/api.php?action=streamconfigs (this is a dependency of EventGate)
    • — This service is maintained on GitLab
    • – This service is configured and deployed via operations/deployment-charts.
    • 🏦 Data is saved to Kafka from here.
  • SRE O11y: statsv.py ✴️ Change: Send dogstatsd lines to Prometheus.
  • SRE O11y: statsd/Graphite
  • SRE O11y: statsd_exporter/Prometheus

Notable changes:

  • EventGate replaces varnishkafka
  • There is no need to include eventlogging or metrics platform as dependencies.
  • I'm not sure why GitLab was a negative, and I disagree that the code there is no more patchable or deployable by MW engineering or SRE.
  • The proposal includes T263049: Avoid extra HTTPS connections for most Event Platform beacons, so there is no intake-analytics.wikimedia.org connection involved.

You can see that from my perspective, there is no consolidation. No part is removed, replaced, or reduced in complexity for us.

Varnishkafka is replaced.

Indeed there are the usual service request routing layers (ATS, k8s). I was not considering those, as those are usually transparent to the requests. But agreed, if you need to have this telemetry data avoid those layers, then they do add extra dependencies.

I think the key term in the quoted sentence sentence is 'for us'. By 'us' here, I take it to mean MW engineering + SRE O11y.
If you changed 'us' to Wikimedia engineering, we could look at this differently.
Complexity is reduced because of the utilization of an already existent system.
Engineers are able to do more things with the same data, rather than implement workarounds or multiple data pipelines. It enables re-use of product metric data.

It also means we would involve an additional intermediary format (schema'ed EventGate event), with a schema that may be versioned and change over time. Rather than have 1 public stable format (dogstatsd spec) throughout from start to finish that won't be broken or versioned.

I'm not sure how this is a downside. Evolving the schema is only done if needed, and if it is tracking opentelemetry / prometheus it will only need changed if they change. Schema changes must be done backwards compatible anyway.

If the use case to join with data lake has a high priority, I suggest pulling data from Prometheus rather than intercepting it on the way to Prometheus.

I'm not sure if pulling from Prometheus is the right way (more custom data systems integration is expensive), however, there is probably another way to do this.

The varnishkafka produced dogstatsd formatted data will be in kafka before it gets to statsv.py, which means that it can be used by other kafka consumers.

Alternative: dogstatsd kafka topic event transformer

  • A: via statsv.py
  • statsv.py ✴️ Change: transform dogstatsd to opentelementry JSON events and send to EventGate or Kafka
  • B: via a new kafka consumer
  • new kafka consumer: ✴️ Change: transform dogstatsd to opentelementry JSON events and send to EventGate or Kafka

Change #1105454 merged by jenkins-bot:

[performance/statsv@master] Further limit regex to only counter type flag

https://gerrit.wikimedia.org/r/1105454

Change #1105367 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgWMEStatsBeaconUri

https://gerrit.wikimedia.org/r/1105367

Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:31:30Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:37:35Z] <tgr@deploy2002> krinkle, tgr: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:53:04Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] (duration: 21m 34s)

A few updates:

A few updates:

Thank you so much for the work here and the documentation!

A few questions:

// Graphite
mw.track( 'timing.MediaWiki.foo_bar', 1234.56 ); // milliseconds
mw.track( 'counter.MediaWiki.foo_quux' ); // defaults to increment=1
mw.track( 'counter.MediaWiki.foo_quux', 1 );

// Prometheus, optional key-value labels
mw.track( 'stats.mediawiki_foo_bar_total' ); // defaults to increment=1
mw.track( 'stats.mediawiki_foo_bar_total', 1, { something: 'quux' } );

It explicitly says to track milliseconds for Graphite. But if I understand the upstream best-practices correctly, then we _should_ use seconds with Prometheus, correct?

  • The statslib PHP interface has StatsFactory::withComponent and when using that with the argument 'GrowthExperiments', then l can access the corresponding metric in Grafana via mediawiki_GrowthExperiments_link_recommendation_dangling_entries. Should that pattern be continued like that with mw.track?
  • https://www.mediawiki.org/wiki/Gadget_kitchen:_recording_metrics#Prometheus has the bullet-point * The "gadget name" part must not include, because the first part after <code>mediawiki_gadget_</code> is used to group metrics belonging to the same gadget. — I think that is missing a word or a clause, but I don't understand it well enough to fix it myself.

[…]

Thank you so much for the work here and the documentation!

[…] https://wikitech.wikimedia.org/wiki/Performance.wikimedia.org/Runbook#How_it_works […] explicitly says to track milliseconds for Graphite. But if I understand the upstream best-practices correctly, then we _should_ use seconds with Prometheus, correct?

The short answer is: Yes, name it _seconds but remember to pass milliseconds as value, same as we did before the migration. This applies to MediaWiki PHP and MediaWiki JS equally.

The longer answer is:

  • Graphite is being replaced with Prometheus. MediaWiki PHP and JS emit StatsD-formatted messages, same as before.
  • At WMF, we ingested these into statsite which translated them to Graphite data points once a minute (ref wikitech:Statsd, wikitech:Graphite#Extended_properties).
  • For the "new" style stats, WMF ingests these into prometheus/statsd_exporter, which translates them to Prometheus.
  • For the "new" style stats, we follow Prometheus naming conventions instead of Graphite naming conventions, which mostly boils down to: no dots in names, include unit suffix in names (_total, _seconds), and permit optional use of dogstatsd "tags" on statistics that are translateable to Prometheus labels.

The trade-off here (between consistency, transparent naming, and developer familiarity/compatibility), for both MediaWiki PHP and MediaWiki JS, is to have timers named as mediawiki_foo_bar_seconds, but with values that are milliseconds, since that's what the statsd protocol requires. This is convenient in JavaScript, where performance.now() uses milliseconds, and our existing stats in MW extension JS all use milliseconds today. Once consumed by prometheus/statsd_exporter, these are converted to seconds, and that's how they'll come out in Grafana.

  • The statslib PHP interface has StatsFactory::withComponent […]. Should that pattern be continued like that with mw.track?

I think in practice, the larger the number of stats a feature has, the more likely it is that they might be better off modelled as labels on a single counter. This means you naturally repeat the full name less often. For example mediawiki_MyExtension_foo_click_total and mediawiki_MyExtension_bar_click_total could instead become mediawiki_MyExtension_click_total{button="foo"} and {button="bar"}. The rule of thumb I tend to use is: 1) Is there a visualisation where I would want to put all in a single graph? If yes, then using a label would mean you could do things like sum by (button) and automatically discover/include them all. 2) Do you need a dropdown menu in your Grafana dashboard? 3) Can a single interaction need to emit two of these? If yes, it might be better kept separate since you generally don't want to increment the same thing twice in a row with different labels. Because you expect sum(mymetric) to produce a meaningful number.

The Statsd.js interface is exposed through mw.track, and designed with message passing as the stable interface. This leaves no place for (dependencies on) runtime abstractions like "withComponent". However, if you have a particularly large number of different stat names in your frontend code, and if you personally choose brevity in the local names over the greppability and transparancy of the stat names themselves, then you may want to place an additional 1-line wrapper function in your code to shorten these accordingly. This doesn't add dependencies, since these would be local to your own codebase. I personally would not, but this a reasonable choice you can make within your team/code.

function increment( shortName ) {
  mw.track( `mediawiki_GrowthExperiments_${shortName}_total );`
}

Thanks. Fixed!

[…]

Thank you so much for the work here and the documentation!

[…] https://wikitech.wikimedia.org/wiki/Performance.wikimedia.org/Runbook#How_it_works […] explicitly says to track milliseconds for Graphite. But if I understand the upstream best-practices correctly, then we _should_ use seconds with Prometheus, correct?

The short answer is: Yes, name it _seconds but remember to pass milliseconds as value, same as we did before the migration. This applies to MediaWiki PHP and MediaWiki JS equally.

The longer answer is:

  • Graphite is being replaced with Prometheus. MediaWiki PHP and JS emit StatsD-formatted messages, same as before.
  • At WMF, we ingested these into statsite which translated them to Graphite data points once a minute (ref wikitech:Statsd, wikitech:Graphite#Extended_properties).
  • For the "new" style stats, WMF ingests these into prometheus/statsd_exporter, which translates them to Prometheus.
  • For the "new" style stats, we follow Prometheus naming conventions instead of Graphite naming conventions, which mostly boils down to: no dots in names, include unit suffix in names (_total, _seconds), and permit optional use of dogstatsd "tags" on statistics that are translateable to Prometheus labels.

The trade-off here (between consistency, transparent naming, and developer familiarity/compatibility), for both MediaWiki PHP and MediaWiki JS, is to have timers named as mediawiki_foo_bar_seconds, but with values that are milliseconds, since that's what the statsd protocol requires. This is convenient in JavaScript, where performance.now() uses milliseconds, and our existing stats in MW extension JS all use milliseconds today. Once consumed by prometheus/statsd_exporter, these are converted to seconds, and that's how they'll come out in Grafana.

Wait what? That needs to be a huge warning message in all related documentation places, because that is extraordinarily counter-intuitive and not at all obvious!
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.

Like

$this->statsd->timing( 'timing.createaccountAbuseFilter', microtime( true ) - $startTime );

or

// Report pure job execution timing
$jobDuration = microtime( true ) - $startTime;
self::stats()->timing(
	"jobexecutor.{$job->getType()}.exec",
	$jobDuration
);

or in GrowthExperiments where that was also the only way timings were recorded pre-migration, we now have:

https://codesearch.wmcloud.org/search/?q=observe%5C%28+mic&files=.*%5C.php%24&excludeFiles=&repos=

In the extensions I have checked out locally, I also found:

$real = microtime( true ) - $stime;

// Stay backward compatible with the dashboard feeding on
// this data. NOTE: $real is in second with microsecond-level
// precision. This is reconciled on the grafana dashboard.
$this->statsdDataFactory->timing( 'centralauth.session.read', $real );

$this->statsFactory->withComponent( 'CentralAuth' )
	->getTiming( 'session_read_seconds' )
	->observe( $real * 1000 );

...

Could you elaborate what difference this makes and why it would actually be worth doing? After all, I need to manually pick the unit in Grafana anyway.
Also, the value I see in Grafana for the two places I just checked (one recording seconds and one recording milliseconds), is the value recorded in PHP, and not that value/1000. Now I'm confused: what does "these are converted to seconds, and that's how they'll come out in Grafana." mean exactly?

  • The statslib PHP interface has StatsFactory::withComponent […]. Should that pattern be continued like that with mw.track?

I think in practice, the larger the number of stats a feature has, the more likely it is that they might be better off modelled as labels on a single counter. This means you naturally repeat the full name less often. For example mediawiki_MyExtension_foo_click_total and mediawiki_MyExtension_bar_click_total could instead become mediawiki_MyExtension_click_total{button="foo"} and {button="bar"}. The rule of thumb I tend to use is: 1) Is there a visualisation where I would want to put all in a single graph? If yes, then using a label would mean you could do things like sum by (button) and automatically discover/include them all. 2) Do you need a dropdown menu in your Grafana dashboard? 3) Can a single interaction need to emit two of these? If yes, it might be better kept separate since you generally don't want to increment the same thing twice in a row with different labels. Because you expect sum(mymetric) to produce a meaningful number.

Thank you, those are useful heuristics to keep in mind💡

[…]
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.

Let's take this discussion elsewhere since this isn't new or related to this task.
I'll respond once but please start a talk page thread or Wikitech-l thread to continue, if you like.

This is an issue with a few legacy Statsd->timing() calls. The new MediaWiki Statslib in PHP was developed in 2023, T240685: MediaWiki Prometheus support. See also https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1056249 where we discussed unit confusion, which was addressed in MediaWiki Statslib with explicitly named methods (observeSeconds, observeNanoseconds), clear method docs, as well as a start-stop utility that measures it right automatically. Feel free to message me directly with examples where the new Statslib is used incorrectly with timers.

In Graphite, there is are no units. Everything is numbers under a string. In core and most extensions, legacy Statsd->timing() calls have consistently used milliseconds. But, there are indeed some "sloppy" callers out there. The impact of this was relatively low given that legacy statsd-graphite doesn't interpret the numbers (beyond blind averaging etc). In Prometheus there is a convention for units. And, Prometheus requires interpretation of timing values in order to create histogram buckets. These histogram buckets have absolute thresholds (like 0.01s, 0.1s, 0.5s, etc). So sending a bad unit will actually have significant consequences when using Prometheus.

Wait what? That needs to be a huge warning message in all related documentation places, because that is extraordinarily counter-intuitive and not at all obvious!
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.

Accidental and intentional misuse of timers is a long-standing problem. We attempted to help guide it all in a better direction in PHP StatsLib by introducing helper functions that did the math under the hood.

There are some patches proposing we enforce (by adding the suffix) or require (by warning on it missing) the correct unit suffix in T364240 so that devs get the correct unit in the Prometheus name (seconds), and need only be aware of supplying the unit (seconds, milliseconds, nanoseconds, etc) to the correct function. This work has stalled for now.

mw.track(), on the other hand, cannot help in this regard. It has to track many things and not just metrics that go to Prometheus.

[…]
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.

Let's take this discussion elsewhere since this isn't new or related to this task.
I'll respond once but please start a talk page thread or Wikitech-l thread to continue, if you like.

Agreed. I'll start a topic about improving the documentation on https://wikitech.wikimedia.org/w/index.php?title=Talk:Performance.wikimedia.org/Runbook in a moment.

examples where the new Statslib is used incorrectly with timers

All of GrowthExperiments. But I'm fixing those in T383208: StatsLib timings MUST be recorded as milliseconds.

Wait what? That needs to be a huge warning message in all related documentation places, because that is extraordinarily counter-intuitive and not at all obvious!
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.

Accidental and intentional misuse of timers is a long-standing problem. We attempted to help guide it all in a better direction in PHP StatsLib by introducing helper functions that did the math under the hood.

There are some patches proposing we enforce (by adding the suffix) or require (by warning on it missing) the correct unit suffix in T364240 so that devs get the correct unit in the Prometheus name (seconds), and need only be aware of supplying the unit (seconds, milliseconds, nanoseconds, etc) to the correct function. This work has stalled for now.

Thanks for pointing me to that task. I've created a change that implements the suggestion from T359248#9994136 to instead have observeMilliseconds() as opposed to observe(), which would have prevented the issues that resulted in T383208. Happy to continue that conversation in T364240.

Ottomata added a parent task: Restricted Task.Jan 10 2025, 7:03 PM