Page MenuHomePhabricator

Analyze SQL queries generating metrics
Closed, ResolvedPublic

Description

Provide an analysis of the queries helping to decide if incrementally built dataset (mediawiki_revision_history) can be used to compute the metrics more frequently.

Event Timeline

@JAllemandou after you look at this, let's talk about Unique Contributors per page for Attribution API. I think the considerations for that will be similar to this.

After some time reading and processing the queries used to generate the metrics asked weekly, here are some findings:

  • The metrics are defined monthly. So the job is not only to run them more frequently, but to change their time granularity to something different. This is important because a lot of the metrics are non-additive.
  • All metrics involve "historical" view of data in some way or another (historical here means the state of the context at the time of the event: was that user a bot? Was that page in a content namespace?).
  • I found what I would name bugs in the metrics definitions, but I wish to discuss this with the people owning those definitions as they could well be features!
  • Graph of metrics dependencies: https://excalidraw.com/#json=EoUhZXIRFAAYN8mXir52K,p2yP66tUIKdxSeVi6q-Thw

My thoughts on the bigger picture plan after this analysis:

  • I think we should define new metrics with daily time frequency, not weekly. This doesn't change whether we wish to release daily, weekly or even monthly.
  • I still think that running Mediawiki-History more frequently is not a viable solution.
  • Given the "historical" aspect needed, I think it'll be easier to compute the metrics directly from events instead of using mediawiki_revision_history, as this dataset, while incremental, doesn't contain historical data on rows. We could add this historical data to the table, but that would be a big change. *Knowing that events are imperfect, I think we should go lambda with events+MWH: Use events for current month data, and adjust (overwrite) the values using MWH when it's released.

Comments and discussion welcome!

I wonder if the edit_per_editor_per_page_daily table we created last fall for Global Editor Metrics would help with the editor focused metrics. It is backfilled from MWH and loaded daily from mediawiki.page_change.v1 events.

Non-additivity just means we'll add metrics (monthly plus weekly) and keep them separate and educate people not to aggregate them.
Curious about the bugs. And I wouldn't jump to solutioning yet.
The key question we need to address: are these metrics resilient to switching to an eventual consistency source data set?

Hi @JAllemandou , the metrics have been developed in accordance to the new Contributor measurement strategy. you can see definitions in the following links-
Core and health metrics
Indicator metrics
pls feel free to reach out if you have any questions

Note: we dont use constructive_editors metric in the dashboard. I haven't explicitly removed the model from the repo yet. I can do it when we re-factor sometime in the near future.
Also, a heads up, we are expecting changes to the definitions of some of these metrics for the new FY.

Hi @JAllemandou , the metrics have been developed in accordance to the new Contributor measurement strategy. you can see definitions in the following links

Very useful doc, thank a lot! I'll reach out for sure, we have things to talk about :