There is a wide range of use cases which look like this:
- Run a slow query on the wiki's database periodically
- Store the results somewhere
- Reuse the results in the wiki's user interface
(Some existing examples are: the user impact module in GrowthExperiments; query pages. But I expect there would be a lot more if this kind of thing would be easier to do.)
Currently the Wikimedia infrastructure doesn't make this kind of thing easy. Normally it's done with a data lake. We have a data lake; but it has several limitations:
- Wiki DB data is imported once a month; for most things that are displayed on a wiki interface, you'd want, at a minimum, daily updates.
- Scheduling queries requires writing nontrivial code in an environment that's unfamiliar to most MediaWiki developers (Hadoop, Spark, Airflow etc).
- A new service needs to be set up for every report, or a new API endpoint has to be fit into some existing service.
Ideally, we'd instead have a system where
- Data from the wiki DB is synced to some data lake daily or more frequently (maybe even in real time).
- There is an easy, declarative way to provide an SQL query an some information on when it should be run, how long the results should be kept etc.
- Similarly, there is an easy, declarative way to expose the results of the query via some API that's shared by all such reports.
Use cases
- T388455: [Spike] Full-year editing stats for Year in Review
- T379119: [Spike] Fetch Topics for Articles in History on iOS app
- T345865: Impact module: Add "Reference added" count
- T378035: [EPIC] Collaborative contributions MVP
- WE1.2 (FY25/26): Increase in the number of collaborations, specifically: "Set up the basic infrastructure to track collaborative contributions, so we can provide innovative new ways to recognize and reward contributions in the future"
- Mentorship: showing collaborative mentee impact
- Track evolution of Wikibase property usage, displaying said usage in the projects themselves (per property) Example1 Example2 (Including the makup / split of Wikidata itself)