The Wikidata Platform Superset dashboard is currently putting a lot of strain on Presto and consistently times out.
To reduce pressure and improve reads performance we need to create an airflow data pipeline and materialize all queries in batch (deriving from the raw log table), store them in physical tables (optimizing storage format), and update our charting (it becomes a SELECT * FROM <TABLE>) .
This was discussed internally here
As part of this task, the engineers in WDP need to be added to the analytics-wikidata-users group, so that they can create and work with the materialized tables in HDFS.
Needs clarification:
AC
- A new namespace for wikidata platform is created in hive / metastore
- An airflow dags is provided that materializes queries in parquet format.
- Datasets are pruned according to Wikimedia's data rention policies.
access request: Add lerickson and trueg to analytics-wikidata-users
- - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
- - User has a valid NDA on file with WMF legal. (All WMF Staff/Contractor hiring are covered by NDA. Other users can be validated via the NDA tracking sheet)
- - User has provided the following: wikitech username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform)
- - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not shared with any other service (this includes not sharing with WMCS access, no shared keys.)
- - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff)
- - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml