There is a wide range of use cases which look like this:
- Run a slow query on the wiki's database periodically
- Store the results somewhere
- Reuse the results in the wiki's user interface
(Some existing examples are: the user impact module in GrowthExperiments; query pages. But I expect there would be a lot more if this kind of thing would be easier to do.)
Currently the Wikimedia infrastructure doesn't make this kind of thing easy. Normally it's done with a data lake. We have a data lake; but it has several limitations:
- Wiki DB data is imported once a month; for most things that are displayed on a wiki interface, you'd want, at a minimum, daily updates.
- Scheduling queries requires writing nontrivial code in an environment that's unfamiliar to most MediaWiki developers (Hadoop, Spark, Airflow etc).
- A new service needs to be set up for every report, or a new API endpoint has to be fit into some existing service.
Ideally, we'd instead have a system where
- Data from the wiki DB is synced to some data lake daily or more frequently (maybe even in real time).
- There is an easy, declarative way to provide an SQL query an some information on when it should be run, how long the results should be kept etc.
- Similarly, there is an easy, declarative way to expose the results of the query via some API that's shared by all such reports.