Page MenuHomePhabricator

Provide an easy way for MediaWiki to fetch aggregate statistics from the data lake
Open, Needs TriagePublic

Description

There is a wide range of use cases which look like this:

  • Run a slow query on the wiki's database periodically
  • Store the results somewhere
  • Reuse the results in the wiki's user interface

(Some existing examples are: the user impact module in GrowthExperiments; query pages. But I expect there would be a lot more if this kind of thing would be easier to do.)

Currently the Wikimedia infrastructure doesn't make this kind of thing easy. Normally it's done with a data lake. We have a data lake; but it has several limitations:

  • Wiki DB data is imported once a month; for most things that are displayed on a wiki interface, you'd want, at a minimum, daily updates.
  • Scheduling queries requires writing nontrivial code in an environment that's unfamiliar to most MediaWiki developers (Hadoop, Spark, Airflow etc).
  • A new service needs to be set up for every report, or a new API endpoint has to be fit into some existing service.

Ideally, we'd instead have a system where

  • Data from the wiki DB is synced to some data lake daily or more frequently (maybe even in real time).
  • There is an easy, declarative way to provide an SQL query an some information on when it should be run, how long the results should be kept etc.
  • Similarly, there is an easy, declarative way to expose the results of the query via some API that's shared by all such reports.

Event Timeline

This is my attempt at a big-picture task on a topic I don't know much about so apologies if I'm reinventing the wheel or missing why such a system would be difficult. It's mostly based on our experiences with T316699: [EPIC] Growth: Positive reinforcement - Iteration 2 where we ended up building a "miniature data lake" in MediaWiki, and it was not a fun experience. (Although it's entirely possible there was a better way and we just didn't know about it - we should have talked to Data Engineering beforehand but didn't think of it.)

Love it, I think you describe the problem and situation well. The dumps work that we're doing right now, like this, is basically working on a more frequent sync. The overall event platform tasks about this are:

T291120: MediaWiki Event Carried State Transfer - Problem Statement and T120242: Eventually Consistent MediaWiki State Change Events as a subset. We spawned the dumps work as one instance of those thoughts there. And I'm doing the QueryPages work you linked to as a thing that can kind of be done for some queries without the more frequent data. Hopefully getting something out pushes us more towards the big picture. More on the dumps work: T330296: Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark