Provide an easy way for MediaWiki to fetch aggregate statistics from the data lake
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tgr
	Jul 12 2023, 4:46 AM

Description

There is a wide range of use cases which look like this:

Run a slow query on the wiki's database periodically
Store the results somewhere
Reuse the results in the wiki's user interface

(Some existing examples are: the user impact module in GrowthExperiments; query pages. But I expect there would be a lot more if this kind of thing would be easier to do.)

Currently the Wikimedia infrastructure doesn't make this kind of thing easy. Normally it's done with a data lake. We have a data lake; but it has several limitations:

Wiki DB data is imported once a month; for most things that are displayed on a wiki interface, you'd want, at a minimum, daily updates.
Scheduling queries requires writing nontrivial code in an environment that's unfamiliar to most MediaWiki developers (Hadoop, Spark, Airflow etc).
A new service needs to be set up for every report, or a new API endpoint has to be fit into some existing service.

Ideally, we'd instead have a system where

Data from the wiki DB is synced to some data lake daily or more frequently (maybe even in real time).
There is an easy, declarative way to provide an SQL query an some information on when it should be run, how long the results should be kept etc.
Similarly, there is an easy, declarative way to expose the results of the query via some API that's shared by all such reports.

Related Objects

Mentioned Here: T120242: Eventually Consistent MediaWiki State Change Events
T291120: MediaWiki Event Carried State Transfer - Problem Statement
T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
T309738: Move Mediawiki QueryPages computation to Hadoop
T316699: [EPIC] Growth: Positive reinforcement - Iteration 2

Event Timeline

Tgr created this task.Jul 12 2023, 4:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2023, 4:46 AM

This is my attempt at a big-picture task on a topic I don't know much about so apologies if I'm reinventing the wheel or missing why such a system would be difficult. It's mostly based on our experiences with T316699: [EPIC] Growth: Positive reinforcement - Iteration 2 where we ended up building a "miniature data lake" in MediaWiki, and it was not a fun experience. (Although it's entirely possible there was a better way and we just didn't know about it - we should have talked to Data Engineering beforehand but didn't think of it.)

@Ladsgroup mentioned T309738: Move Mediawiki QueryPages computation to Hadoop which I think is a subset of this problem.

Love it, I think you describe the problem and situation well. The dumps work that we're doing right now, like this, is basically working on a more frequent sync. The overall event platform tasks about this are:

T291120: MediaWiki Event Carried State Transfer - Problem Statement and T120242: Eventually Consistent MediaWiki State Change Events as a subset. We spawned the dumps work as one instance of those thoughts there. And I'm doing the QueryPages work you linked to as a thing that can kind of be done for some queries without the more frequent data. Hopefully getting something out pushes us more towards the big picture. More on the dumps work: T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark

kostajh subscribed.Sep 11 2024, 10:46 AM

Provide an easy way for MediaWiki to fetch aggregate statistics from the data lakeOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Provide an easy way for MediaWiki to fetch aggregate statistics from the data lake
Open, Needs TriagePublic
Actions