Provide an analysis of the queries helping to decide if incrementally built dataset (mediawiki_revision_history) can be used to compute the metrics more frequently.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | Ahoelzl | T418032 Weekly delivery cadence of core contributor metrics | |||
| Resolved | JAllemandou | T420434 Analyze SQL queries generating metrics |
Event Timeline
@JAllemandou after you look at this, let's talk about Unique Contributors per page for Attribution API. I think the considerations for that will be similar to this.
After some time reading and processing the queries used to generate the metrics asked weekly, here are some findings:
- The metrics are defined monthly. So the job is not only to run them more frequently, but to change their time granularity to something different. This is important because a lot of the metrics are non-additive.
- All metrics involve "historical" view of data in some way or another (historical here means the state of the context at the time of the event: was that user a bot? Was that page in a content namespace?).
- I found what I would name bugs in the metrics definitions, but I wish to discuss this with the people owning those definitions as they could well be features!
- Graph of metrics dependencies: https://excalidraw.com/#json=EoUhZXIRFAAYN8mXir52K,p2yP66tUIKdxSeVi6q-Thw
My thoughts on the bigger picture plan after this analysis:
- I think we should define new metrics with daily time frequency, not weekly. This doesn't change whether we wish to release daily, weekly or even monthly.
- I still think that running Mediawiki-History more frequently is not a viable solution.
- Given the "historical" aspect needed, I think it'll be easier to compute the metrics directly from events instead of using mediawiki_revision_history, as this dataset, while incremental, doesn't contain historical data on rows. We could add this historical data to the table, but that would be a big change. *Knowing that events are imperfect, I think we should go lambda with events+MWH: Use events for current month data, and adjust (overwrite) the values using MWH when it's released.
Comments and discussion welcome!
I wonder if the edit_per_editor_per_page_daily table we created last fall for Global Editor Metrics would help with the editor focused metrics. It is backfilled from MWH and loaded daily from mediawiki.page_change.v1 events.
Non-additivity just means we'll add metrics (monthly plus weekly) and keep them separate and educate people not to aggregate them.
Curious about the bugs. And I wouldn't jump to solutioning yet.
The key question we need to address: are these metrics resilient to switching to an eventual consistency source data set?
Hi @JAllemandou , the metrics have been developed in accordance to the new Contributor measurement strategy. you can see definitions in the following links-
Core and health metrics
Indicator metrics
pls feel free to reach out if you have any questions
Note: we dont use constructive_editors metric in the dashboard. I haven't explicitly removed the model from the repo yet. I can do it when we re-factor sometime in the near future.
Also, a heads up, we are expecting changes to the definitions of some of these metrics for the new FY.
Hi @JAllemandou , the metrics have been developed in accordance to the new Contributor measurement strategy. you can see definitions in the following links
Very useful doc, thank a lot! I'll reach out for sure, we have things to talk about :
I've written a plan for Incremental-Mediawiki-History here: https://docs.google.com/document/d/1QZNCZhsBCxEKwogI8S1GFtELTPa0t9DYUBFoc3jI-oo/edit?tab=t.0
Calling this done.