The Wikidata Platform Superset dashboard is currently putting a lot of strain on Presto and consistently times out.
To reduce pressure and improve reads performance we need to create an airflow data pipeline and materialize all queries in batch (deriving from the raw log table), store them in physical tables (optimizing storage format), and update our charting (it becomes a SELECT * FROM `<TABLE>`) .
This was discussed internally [here](https://wikimedia.slack.com/archives/CSV483812/p1770978143693779)
As part of this task, the engineers in WDP need to be added to the analytics-wikidata-users group, so that they can create and work with the materialized tables in HDFS.
Needs clarification:
- Should we use [iceberg](https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Iceberg) for exposing this data?
- Should we consider [dbt](https://wikimedia.slack.com/archives/CSV483812/p1771497550433019?thread_ts=1770978143.693779&cid=CSV483812) for managing queries?
AC
- [] A new namespace for wikidata platform is created in hive / metastore
- [] An airflow dags is provided that materializes queries in parquet format.
- [] Datasets are pruned according to Wikimedia's data rention policies.