mw.track() is a frontend JS feature that collects metrics produced in browsers and forwards them to Graphite.
This task is complete when there is a migration strategy in place to migrate metrics produced by mw.track() into Prometheus.
mw.track() is a frontend JS feature that collects metrics produced in browsers and forwards them to Graphite.
This task is complete when there is a migration strategy in place to migrate metrics produced by mw.track() into Prometheus.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T343020 Converting MediaWiki Metrics to StatsLib | |||
| Restricted Task | |||||
| Resolved | Krinkle | T355837 Add Prometheus support to statsd.js via mw.track() | |||
| Declined | None | T180105 Set up a statsv-like endpoint for Prometheus | |||
| Invalid | None | T193445 Update Media dashboard in Grafana to use Prometheus metrics | |||
| Resolved | colewhite | T271567 statsv seems to be down | |||
| Resolved | DAlangi_WMF | T359168 Knowledge sharing MWP - SRE O11Y mw.track() |
Relevant for our proposal: T263049: Avoid extra HTTPS connections for most Event Platform beacons
Thank you for the proposal; it seems we have some options forward. To move us towards a decision, I’d like to raise a couple of points for consideration:
I don’t intend to rush the overall discovery process or the discussion; however, I wanted to check with you to see if we have enough to work toward a decision.
allowing metric owners to choose which mechanism and data store best suits them?
FWIW, with the event platform consolidation proposal, metrics owners do not have to choose. All of these metrics will go both to Prometheus and to the Data Lake automatically.
Is there merit in trying both approaches in parallel
I'm not sure it would be worth the effort to implement both. I'd hope we can reach a decision without implementing both!
If the path forward must be one or the other, how do we reach a consensus? Moreover, who is responsible/accountable for deciding on the approach?
Goooood question! This is maybe what we should meet about. There is quite a bit of overlap between Observability and Data Platform Engineering. We both do metric and log collection, and dashboarding. Product teams have to navigate between multiple systems to find the ones that works best for them. I'd like it if we could work together to unify the backend data formats and transports, so that the data can be consumed into and used in the whichever consumer interface/serving layers (Prometheus+Grafana, Hive+Superset/Turnilo/Jupyter, etc. etc.) make the most for our use cases.
In this case, if I were to propose some owners:
I think to decide this, we'd ideally need a person (a PM?) from each of those teams.
This is very compelling.
I'm not sure it would be worth the effort to implement both. I'd hope we can reach a decision without implementing both!
+1
Goooood question! This is maybe what we should meet about. There is quite a bit of overlap between Observability and Data Platform Engineering. We both do metric and log collection, and dashboarding. Product teams have to navigate between multiple systems to find the ones that works best for them. I'd like it if we could work together to unify the backend data formats and transports, so that the data can be consumed into and used in the whichever consumer interface/serving layers (Prometheus+Grafana, Hive+Superset/Turnilo/Jupyter, etc. etc.) make the most for our use cases.
I completely agree that collaboration is key to moving this forward; we're eager to work on this together
In this case, if I were to propose some owners:
- Data Products/Metrics Platform and/or a MediaWiki team (which one?) would own the MW JS implementation.
- Data Engineering + SRE Traffic would own /beacon/v2/event -> eventgate-analytics-external routing
- Data Engineering and/or SRE Observability would own kafka -> prometheus consumer processing (AKA statsv.py)
I think to decide this, we'd ideally need a person (a PM?) from each of those teams.
I'll work on facilitating this conversation. Thanks @Ottomata!
Change #1092256 had a related patch set uploaded (by Krinkle; author: Krinkle):
[mediawiki/core@master] ResourceLoader: Support multiple mw.track() data arguments
@Krinkle: Do you need a review for that patch and the patch(es) in the relation chain?
@phuedx I'd prefer to use this opppertunity to build knowledge/familiarity within my team for the code we maintain.
Change #1092312 had a related patch set uploaded (by Krinkle; author: Krinkle):
[mediawiki/extensions/WikimediaEvents@master] stats: Add unit tests
Change #1092910 had a related patch set uploaded (by Krinkle; author: Krinkle):
[performance/statsv@master] Remove unused statsvr.py
Change #1092312 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] stats: Add unit tests
Change #1092256 merged by jenkins-bot:
[mediawiki/core@master] ResourceLoader: Support multiple mw.track() data arguments
Change #1092910 merged by jenkins-bot:
[performance/statsv@master] Remove unused statsvr.py
In the Growth team, we added the first sub-tasks about migrating to statslib to our next sprint. We do have quite a number of metrics that we track from javascript in the browser. From glancing at the comments and seeing some merged tasks, I struggle to infer how ready this is.
Can we start migrating the mw.track( ... ) calls now, and if so how? Or is this still blocked, and we should just push this work to some time maybe next quarter?
Some examples that we do have in code:
mw.track( 'timing.growthExperiments.suggestedEdits.taskEditorReady.' + this.taskType + ( OO.ui.isMobile() ? '.mobile' : '.desktop' ), window.performance.now() - this.startTime );
mw.track( // Using 'timing' for transfer size sounds conceptually wrong, but we want // the various features that statsd timing gives us (see // https://github.com/statsd/statsd/blob/master/docs/metric_types.md) 'timing.growthExperiments.specialHomepage.navigationTransferSize', navigationEntries[ 0 ].transferSize );
mw.track( 'counter.growthExperiments.StructuredTask.noSuggestionsDialog.' + taskType );
@Michael The replacement is not yet ready. We aim to have this ready by end of quarter. You can plan to add (or replace) Prometheus-bound statsd.js calls starting in January.
Change #1098517 had a related patch set uploaded (by Krinkle; author: Krinkle):
[mediawiki/extensions/WikimediaEvents@master] Make use of built-in packageFiles `config` feature
Change #1098521 had a related patch set uploaded (by Krinkle; author: Krinkle):
[mediawiki/extensions/WikimediaEvents@master] statsd: Create a `stats` mw.track topic for dogstatsd/Prometheus
Thank you for the quick response! Then we will postpone that work till the new year.
Change #1099720 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/puppet@production] webperf: set statsv.py --statsd to statsd.eqiad.wmnet
Change #1099796 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] webperf: disable statsd-exporter relaying flag
Change #1099821 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] webperf: set statsd exporter timer type to histogram
Change #1099822 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] prometheus: restart statsd-exporter on config change
Change #1098517 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Make use of built-in packageFiles `config` feature
Change #1098521 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] statsd: Create a `stats` mw.track topic for dogstatsd/Prometheus
Change #1099822 merged by Cwhite:
[operations/puppet@production] prometheus: restart statsd-exporter on config change
Change #1099720 merged by Cwhite:
[operations/puppet@production] webperf: set statsv.py --statsd to statsd.eqiad.wmnet
Change #1099796 merged by Cwhite:
[operations/puppet@production] webperf: disable statsd-exporter relaying flag
Change #1099821 merged by Cwhite:
[operations/puppet@production] webperf: set statsd exporter timer type to histogram
Change #1098570 had a related patch set uploaded (by Krinkle; author: Krinkle):
[performance/statsv@master] statsv: Add dogstatsd/Prometheus support
Change #1105367 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/mediawiki-config@master] Enable $wgWMEStatsBeaconUri
Change #1105372 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/puppet@production] webperf: Enable --dogstatsd on statsv.py
Change #1105423 had a related patch set uploaded (by Cwhite; author: Cwhite):
[performance/statsv@master] update kafka-python version to match webperf hosts
Change #1098570 merged by jenkins-bot:
[performance/statsv@master] statsv: Add dogstatsd/Prometheus support
Change #1105454 had a related patch set uploaded (by Cwhite; author: Cwhite):
[performance/statsv@master] Further limit regex to only counter type flag
Change #1105423 merged by jenkins-bot:
[performance/statsv@master] update kafka-python version to match webperf hosts
There are four reasons I opted to complete this task by maintaining the simpler stack we have today, rather than integrating EventGate.
You can see that from my perspective, there is no consolidation. No part is removed, replaced, or reduced in complexity for us. It would however add numerous layers ahead of "🏦 Data is saved to Kafka from here", and reduce ability to debug and patch key points. It also means we would involve an additional intermediary format (schema'ed EventGate event), with a schema that may be versioned and change over time. Rather than have 1 public stable format (dogstatsd spec) throughout from start to finish that won't be broken or versioned.
If the use case to join with data lake has a high priority, I suggest pulling data from Prometheus rather than intercepting it on the way to Prometheus. That way you'd not be limited to the one narrow case of MediaWiki JS stats, and it wouldn't require each instrumentation to be modified. You'd have the option to join with any Prometheus data, including e.g. from Node.js services, MediaWiki PHP, and much much more.
Conclusion: We have not precluded EventGate integration in anyway. As shown above, the changes we've made, are needed regardless.
Thank you. It looks like this applies to Fundraising /beacon/impression as well. I've subscribed myself to T370668: New software: haproxykafka, to remind us next quarter to look at haproxykafka. That would remove a layer in our stack (Varnish), and moves the point where we save the data one step earlier in the stack, both of which I certainly welcome!
Thank you so much Timo for the details!
Punt new use cases (A: join prometheus with data lake, B: analyst unfamiliar with Grafana): I discussed this with @JTweed-WMF and we agreed this is worth exploring, but we prefer to discuss this outside the critical path of a project that is already several months behind schedule. We're open to revisiting this with a wider context, looking at the tools and options available, not now when under pressure from what is largely an orthogonal task.
This reasoning makes sense, and is the perspective I was looking for.
Change #1105372 merged by Cwhite:
[operations/puppet@production] webperf: Enable --dogstatsd on statsv.py
Now on to some technical comments / clarifications :)
Compatibility: Support HTTP-GET.
If this is a requirement, then this reasoning makes sense.
When we embarked on on the original Modern Event Platform program that resulted in replacing the legacy EventLogging GET with a POST to EventGate, some analysis was done around the need to track metrics from browsers without JavaScript (if needed, I can try to do some phab/google doc archaeology to find the analysis), and it was determined to not be relevant. The number of requests coming from JavaScript-less clients was small enough.
Most prominently this affects research on JavaScript reliability (how often JavaScript fails to execute in modern browsers)
Ah ha, indeed. I doubt this kind of analysis was considered in the product metrics use case. Okay.
We could support GET in EventGate if we had to, via good ol' URL encoded JSON strings.
It's quite another to divert Tier 1 telemetry on the way to Prometheus into a data lake instead, and have it only become observable to alerting after it is consumed from there. We don't do this for any service, including not for MediaWiki
The proposal is not to go from Data Lake -> Prometheus. It is to go from Kafka -> Prometheus, just like statsv is doing now. MW logging also goes Kafka -> logstash elasticsearch. EventGate is used for SRE Network Error Logging, for which there is T304373: Also intake Network Error Logging events into the Analytics Data Lake.
I've got a some corrections to the "Dependency Graphs." I'll try to make them here.
**Dependency graph: Today*
Overview: Browser → navigator.sendBeacon() → HAProxy → Varnish -> Varnishkafka → Kafka → statsv.py
Varnishkafka was missing in your current pipeline. The Alternative: EventGate proposal is to replace Varnishkafka with EventGate.
Dependency graph: Alternative: EventGate
Overview: Browser → navigator.sendBeacon() → HAProxy → Varnish → ❇️ EventGate (replaces > Varnishkafka) → Kafka → statsv.py
- Browser: en.wikipedia.org/wiki/Main_Page
- MediaWiki Example feature: call mw.track()
- MediaWiki core ResourceLoader: mw.track function
- MediaWiki statsd.js: call navigator.sendBeacon
- ✴️ Change: Update format from statsd to JSON with to-be-created opentelementry JSON event schema for WMF Event Platform
- ✅ The above components are all in GitLab/Gerrit, and fully debuggable/patchable/deployable by MediaWiki Engineering and SRE Observability.
- Web request: <same-site>beacon/v2/event (supported by T263049 Avoid extra HTTPS connections for most Event Platform beacons)
- Traffic: LVS, HAProxy
- Traffic: Varnish
- ❇️ Add: Traffic: ATS
- ❇️ Add: Kubernetes load balancer
- ❇️ TODO: Define and agree on a dogstatsd schema for WMF use. Own and maintain this versioned schema long-term.
- ❇️ Add: EventGate Node.js service
- ❇️ Add: schema.wikimedia.org service or schema repository (this is a dependency of EventGate)
- ❇️ Add: meta.wikimedia.org/w/api.php?action=streamconfigs (this is a dependency of EventGate)
- — This service is maintained on GitLab
- – This service is configured and deployed via operations/deployment-charts.
- 🏦 Data is saved to Kafka from here.
- SRE O11y: statsv.py ✴️ Change: Send dogstatsd lines to Prometheus.
- SRE O11y: statsd/Graphite
- SRE O11y: statsd_exporter/Prometheus
Notable changes:
You can see that from my perspective, there is no consolidation. No part is removed, replaced, or reduced in complexity for us.
Varnishkafka is replaced.
Indeed there are the usual service request routing layers (ATS, k8s). I was not considering those, as those are usually transparent to the requests. But agreed, if you need to have this telemetry data avoid those layers, then they do add extra dependencies.
I think the key term in the quoted sentence sentence is 'for us'. By 'us' here, I take it to mean MW engineering + SRE O11y.
If you changed 'us' to Wikimedia engineering, we could look at this differently.
Complexity is reduced because of the utilization of an already existent system.
Engineers are able to do more things with the same data, rather than implement workarounds or multiple data pipelines. It enables re-use of product metric data.
It also means we would involve an additional intermediary format (schema'ed EventGate event), with a schema that may be versioned and change over time. Rather than have 1 public stable format (dogstatsd spec) throughout from start to finish that won't be broken or versioned.
I'm not sure how this is a downside. Evolving the schema is only done if needed, and if it is tracking opentelemetry / prometheus it will only need changed if they change. Schema changes must be done backwards compatible anyway.
If the use case to join with data lake has a high priority, I suggest pulling data from Prometheus rather than intercepting it on the way to Prometheus.
I'm not sure if pulling from Prometheus is the right way (more custom data systems integration is expensive), however, there is probably another way to do this.
The varnishkafka produced dogstatsd formatted data will be in kafka before it gets to statsv.py, which means that it can be used by other kafka consumers.
Alternative: dogstatsd kafka topic event transformer
- statsv.py ✴️ Change: transform dogstatsd to opentelementry JSON events and send to EventGate or Kafka
- new kafka consumer: ✴️ Change: transform dogstatsd to opentelementry JSON events and send to EventGate or Kafka
Change #1105454 merged by jenkins-bot:
[performance/statsv@master] Further limit regex to only counter type flag
Change #1105367 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable $wgWMEStatsBeaconUri
Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:31:30Z] <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]]
Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:37:35Z] <tgr@deploy2002> krinkle, tgr: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-12-19T21:53:04Z] <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1105735|Make AuthManagerAutoConfig configuration key more distinctive (T369180)]], [[gerrit:1105739|SUL3: Disable more auth providers in the local leg of SUL3 login (T369180)]], [[gerrit:1105742|[noop] Update private/readme.php (T369180)]], [[gerrit:1105367|Enable $wgWMEStatsBeaconUri (T355837)]] (duration: 21m 34s)
A few updates:
Example of first migrant: T382549: (mw.track) Migrate MediaWiki.edit_recovery.* to Prometheus.
Thank you so much for the work here and the documentation!
A few questions:
// Graphite mw.track( 'timing.MediaWiki.foo_bar', 1234.56 ); // milliseconds mw.track( 'counter.MediaWiki.foo_quux' ); // defaults to increment=1 mw.track( 'counter.MediaWiki.foo_quux', 1 ); // Prometheus, optional key-value labels mw.track( 'stats.mediawiki_foo_bar_total' ); // defaults to increment=1 mw.track( 'stats.mediawiki_foo_bar_total', 1, { something: 'quux' } );
It explicitly says to track milliseconds for Graphite. But if I understand the upstream best-practices correctly, then we _should_ use seconds with Prometheus, correct?
The short answer is: Yes, name it _seconds but remember to pass milliseconds as value, same as we did before the migration. This applies to MediaWiki PHP and MediaWiki JS equally.
The longer answer is:
The trade-off here (between consistency, transparent naming, and developer familiarity/compatibility), for both MediaWiki PHP and MediaWiki JS, is to have timers named as mediawiki_foo_bar_seconds, but with values that are milliseconds, since that's what the statsd protocol requires. This is convenient in JavaScript, where performance.now() uses milliseconds, and our existing stats in MW extension JS all use milliseconds today. Once consumed by prometheus/statsd_exporter, these are converted to seconds, and that's how they'll come out in Grafana.
- The statslib PHP interface has StatsFactory::withComponent […]. Should that pattern be continued like that with mw.track?
I think in practice, the larger the number of stats a feature has, the more likely it is that they might be better off modelled as labels on a single counter. This means you naturally repeat the full name less often. For example mediawiki_MyExtension_foo_click_total and mediawiki_MyExtension_bar_click_total could instead become mediawiki_MyExtension_click_total{button="foo"} and {button="bar"}. The rule of thumb I tend to use is: 1) Is there a visualisation where I would want to put all in a single graph? If yes, then using a label would mean you could do things like sum by (button) and automatically discover/include them all. 2) Do you need a dropdown menu in your Grafana dashboard? 3) Can a single interaction need to emit two of these? If yes, it might be better kept separate since you generally don't want to increment the same thing twice in a row with different labels. Because you expect sum(mymetric) to produce a meaningful number.
The Statsd.js interface is exposed through mw.track, and designed with message passing as the stable interface. This leaves no place for (dependencies on) runtime abstractions like "withComponent". However, if you have a particularly large number of different stat names in your frontend code, and if you personally choose brevity in the local names over the greppability and transparancy of the stat names themselves, then you may want to place an additional 1-line wrapper function in your code to shorten these accordingly. This doesn't add dependencies, since these would be local to your own codebase. I personally would not, but this a reasonable choice you can make within your team/code.
function increment( shortName ) { mw.track( `mediawiki_GrowthExperiments_${shortName}_total );` }
- https://www.mediawiki.org/wiki/Gadget_kitchen:_recording_metrics#Prometheus has the bullet-point * The "gadget name" part must not include, — I think that is missing a word or a clause […]
Thanks. Fixed!
Wait what? That needs to be a huge warning message in all related documentation places, because that is extraordinarily counter-intuitive and not at all obvious!
If I search for microtime( true ) in extensions, I see them mixing tracking seconds and milliseconds all over the place, so I'm not the only one.
Like
$this->statsd->timing( 'timing.createaccountAbuseFilter', microtime( true ) - $startTime );
or
// Report pure job execution timing $jobDuration = microtime( true ) - $startTime; self::stats()->timing( "jobexecutor.{$job->getType()}.exec", $jobDuration );
or in GrowthExperiments where that was also the only way timings were recorded pre-migration, we now have:
https://codesearch.wmcloud.org/search/?q=observe%5C%28+mic&files=.*%5C.php%24&excludeFiles=&repos=
In the extensions I have checked out locally, I also found:
$real = microtime( true ) - $stime; // Stay backward compatible with the dashboard feeding on // this data. NOTE: $real is in second with microsecond-level // precision. This is reconciled on the grafana dashboard. $this->statsdDataFactory->timing( 'centralauth.session.read', $real ); $this->statsFactory->withComponent( 'CentralAuth' ) ->getTiming( 'session_read_seconds' ) ->observe( $real * 1000 );
...
Could you elaborate what difference this makes and why it would actually be worth doing? After all, I need to manually pick the unit in Grafana anyway.
Also, the value I see in Grafana for the two places I just checked (one recording seconds and one recording milliseconds), is the value recorded in PHP, and not that value/1000. Now I'm confused: what does "these are converted to seconds, and that's how they'll come out in Grafana." mean exactly?
Thank you, those are useful heuristics to keep in mind💡
Let's take this discussion elsewhere since this isn't new or related to this task.
I'll respond once but please start a talk page thread or Wikitech-l thread to continue, if you like.
This is an issue with a few legacy Statsd->timing() calls. The new MediaWiki Statslib in PHP was developed in 2023, T240685: MediaWiki Prometheus support. See also https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1056249 where we discussed unit confusion, which was addressed in MediaWiki Statslib with explicitly named methods (observeSeconds, observeNanoseconds), clear method docs, as well as a start-stop utility that measures it right automatically. Feel free to message me directly with examples where the new Statslib is used incorrectly with timers.
In Graphite, there is are no units. Everything is numbers under a string. In core and most extensions, legacy Statsd->timing() calls have consistently used milliseconds. But, there are indeed some "sloppy" callers out there. The impact of this was relatively low given that legacy statsd-graphite doesn't interpret the numbers (beyond blind averaging etc). In Prometheus there is a convention for units. And, Prometheus requires interpretation of timing values in order to create histogram buckets. These histogram buckets have absolute thresholds (like 0.01s, 0.1s, 0.5s, etc). So sending a bad unit will actually have significant consequences when using Prometheus.
Accidental and intentional misuse of timers is a long-standing problem. We attempted to help guide it all in a better direction in PHP StatsLib by introducing helper functions that did the math under the hood.
There are some patches proposing we enforce (by adding the suffix) or require (by warning on it missing) the correct unit suffix in T364240 so that devs get the correct unit in the Prometheus name (seconds), and need only be aware of supplying the unit (seconds, milliseconds, nanoseconds, etc) to the correct function. This work has stalled for now.
mw.track(), on the other hand, cannot help in this regard. It has to track many things and not just metrics that go to Prometheus.
Agreed. I'll start a topic about improving the documentation on https://wikitech.wikimedia.org/w/index.php?title=Talk:Performance.wikimedia.org/Runbook in a moment.
examples where the new Statslib is used incorrectly with timers
All of GrowthExperiments. But I'm fixing those in T383208: StatsLib timings MUST be recorded as milliseconds.
Thanks for pointing me to that task. I've created a change that implements the suggestion from T359248#9994136 to instead have observeMilliseconds() as opposed to observe(), which would have prevented the issues that resulted in T383208. Happy to continue that conversation in T364240.