This task is to document the current infrastructure and assets of WMDE Analytics. The task text will be updated as the discussion below progresses.
Tables
- presto_analytics_hive/goransm/wdcm_clients_wb_entity_usage
- This table cannot be queried as it's in his private repo
- General schema is:
- eu_row_id: BIGINT
- eu_entity_id: VARCHAR
- eu_aspect: VARCHAR
- eu_page_id: BIGINT
- wiki_db: VARCHAR
- A similar table exists at presto_analytics_iceberg/goransm/wdcm_clients_wb_entity_usage
Data dumps
- https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/
- https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
- https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/qurator/
- https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/
Servers
The following are found on Cloud VPS:
- wikidata-analytics-1
- wiktionary-cognate-1
Code repos
CRON jobs
The following is via https://phabricator.wikimedia.org/T334951#8980911 (see also WMDE Analytics Clients Schedule):
- WDCM_Sqoop_Clients runs on stat1004 weekly - It doesn't run spark (but Sqoop)
- 2021_WMDE_Mitmachen_Bereich_2021_Campaign runs on stat1007 daily - It doesn't run spark (but Hive)
- WD_PageviewsPerType runs on stat1007 daily but has been failing since February 17th - It runs a spark job
- WD_UsageCoverage runs on stat1008 daily
- WD_languagesLandscape runs on stat1008 monthly (30th of the month)
- Wiktionary_CognateDashboard runs on stat1008 daily
- WDCM_EngineBiases runs on stat1008 weekly
- Qurator_CuriousFacts runs on stat1008 monthly (10th of the month)
- WMDE_BannerImpressions runs on stat1008 hourly
- NewEditors_comprehensive_report runs on stat1008 daily