Page MenuHomePhabricator

Provide an easy way for MediaWiki to fetch aggregate data from the data lake
Open, Needs TriagePublic

Description

There is a wide range of use cases which look like this:

  • Run a slow query on the wiki's database periodically
  • Store the results somewhere
  • Reuse the results in the wiki's user interface

(Some existing examples are: the user impact module in GrowthExperiments; query pages. But I expect there would be a lot more if this kind of thing would be easier to do.)

Currently the Wikimedia infrastructure doesn't make this kind of thing easy. Normally it's done with a data lake. We have a data lake; but it has several limitations:

  • Wiki DB data is imported once a month; for most things that are displayed on a wiki interface, you'd want, at a minimum, daily updates.
  • Scheduling queries requires writing nontrivial code in an environment that's unfamiliar to most MediaWiki developers (Hadoop, Spark, Airflow etc).
  • A new service needs to be set up for every report, or a new API endpoint has to be fit into some existing service.

Ideally, we'd instead have a system where

  • Data from the wiki DB is synced to some data lake daily or more frequently (maybe even in real time).
  • There is an easy, declarative way to provide an SQL query an some information on when it should be run, how long the results should be kept etc.
  • Similarly, there is an easy, declarative way to expose the results of the query via some API that's shared by all such reports.

Use cases

Related Objects

Event Timeline

This is my attempt at a big-picture task on a topic I don't know much about so apologies if I'm reinventing the wheel or missing why such a system would be difficult. It's mostly based on our experiences with T316699: [EPIC] Growth: Positive reinforcement - Iteration 2 where we ended up building a "miniature data lake" in MediaWiki, and it was not a fun experience. (Although it's entirely possible there was a better way and we just didn't know about it - we should have talked to Data Engineering beforehand but didn't think of it.)

Love it, I think you describe the problem and situation well. The dumps work that we're doing right now, like this, is basically working on a more frequent sync. The overall event platform tasks about this are:

T291120: MediaWiki Event Carried State Transfer - Problem Statement and T120242: Eventually Consistent MediaWiki State Change Events as a subset. We spawned the dumps work as one instance of those thoughts there. And I'm doing the QueryPages work you linked to as a thing that can kind of be done for some queries without the more frequent data. Hopefully getting something out pushes us more towards the big picture. More on the dumps work: T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark

BTW I gave a state of the data platform talk at the WMF Data Strategy convening in November 2024.

There's lots of detail, but if you scroll to the end where it talks about risks, this exact problem is described.

Now what do we do about it?

See https://wikitech.wikimedia.org/wiki/MediaWiki_External_Data_Problem for more use cases.

Thanks for filing, added this to our essential work catalog.

Another possible use case:

Ideally we can eventually show aggregate stats for Mentors. This has always been part of the long-term plans for Mentorship: Mentorship impact. A community member just suggested some statistics that would be useful for Mentors in this discussion.

BTW I gave a state of the data platform talk at the WMF Data Strategy convening in November 2024.

There's lots of detail, but if you scroll to the end where it talks about risks, this exact problem is described.

Now what do we do about it?

See https://wikitech.wikimedia.org/wiki/MediaWiki_External_Data_Problem for more use cases.

I think the link should have been: https://wikitech.wikimedia.org/wiki/MediaWiki_Externalized_Data_Problem

Another couple of use cases based on some "fun" I have had with the new iceberg revision datasets is collecting metrics on the state of Wikimedia commons depicts, and instance types on wikidata and feeding that data back into mediawiki to display static pie charts and line charts showing growth in those areas over time.
This would for example allow wiki projects to see the impact they have having when adding depicts statements on masa.
And would allow wikidata to monitor the content split, particularly with bibliographic items.
One could argue some of this is possible via the query service, however the data lake is likely a better fit

Nice, @Addshore, could you summarize that in a brief one or two items and add them to the task description? Thank you!

Ottomata renamed this task from Provide an easy way for MediaWiki to fetch aggregate statistics from the data lake to Provide an easy way for MediaWiki to fetch aggregate data from the data lake.Aug 26 2025, 3:55 PM