(this is a follow up to the April 2026 #wikimedia-cloud IRC discussion )
Looking at T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema I believe we will never be able to create a new "grand schema".
Basically I believe that
We figure out who needs what, make good schemas,
can never be done properly and in finite time.
But we should be able to have an environment where requests like T414199: Add datetime versions of timestamp fields to Wikireplica databases could be implemented gradually and step by step.
Therefore what I would propose is to have a possibility to create custom views which can be based on the replica database. Basically some writable database that has easy access to the replica data. Some views could be materialized if needed.
This is explicitly for normalized data (unlike T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users) although some inspirations can be drawn from the data lake efforts definitely.
Data stewarding?
Managing those views could be a challenge, for example it would be great that we de-duplicate certain views into one concept, so we might need some data stewards working on that. Even then, I believe a framework that allows the creativity of tool authors to grow is better than the current state of "this is technically not possible right now".
We might want to try to replicate some data into our own non-WMF infra and experiment with this and we might report our progress here if desired.
Example use cases
Some use cases right off my table, in no particular order
- https://quarry.wmcloud.org/query/102749 - we are right now starting to add more CTE's to this query. Some of those could become reusable views.
- We might try to integrate another report from this Java code - in the end, there is no reason why this and other bots could not re-use a view that holds our "business definition" of the data we need (in this case some kind of activity streak).
- I have also another need to work much closer with the geographical coordinate data from the Wikidata coordinate location property (P626) which would probably require that those data are replicated into PostgreSQL and PostGIS, ideally integrated with some OpenStreetMap data for further analytics.
- If I get this into PostgreSQL due to the above (PostGIS), I might as well use built in notifications instead of Kafka.
- Some tools not covered by Abandoned tool policy such as outreachdashboard and petscan could contribute to the writable schema as well as long as their public data are concerned (via replicating to the new DB or something).
(not all of the above need to fit into this task of course)
If we find more and more tools to agree on certain common views, this could be a major step forward.
I wouldn't focus exclusively on online analytics processing here if we can, just opening up potential analytical use cases.