Page MenuHomePhabricator

Investigate prior WMDE analytics tables / assets
Closed, ResolvedPublic

Description

This task is to document the current infrastructure and assets of WMDE Analytics. The task text will be updated as the discussion below progresses.

Tables

  • presto_analytics_hive/goransm/wdcm_clients_wb_entity_usage
    • This table cannot be queried as it's in his private repo
    • General schema is:
      • eu_row_id: BIGINT
      • eu_entity_id: VARCHAR
      • eu_aspect: VARCHAR
      • eu_page_id: BIGINT
      • wiki_db: VARCHAR
  • A similar table exists at presto_analytics_iceberg/goransm/wdcm_clients_wb_entity_usage

Data dumps

Servers

The following are found on Cloud VPS:

  • wikidata-analytics-1
  • wiktionary-cognate-1

Code repos

  • WikidataAnalytics (includes Wikidata Concepts Monitor)
  • WiktionaryCognateDashboard

CRON jobs

The following is via https://phabricator.wikimedia.org/T334951#8980911 (see also WMDE Analytics Clients Schedule):

  • WDCM_Sqoop_Clients runs on stat1004 weekly - It doesn't run spark (but Sqoop)
  • 2021_WMDE_Mitmachen_Bereich_2021_Campaign runs on stat1007 daily - It doesn't run spark (but Hive)
  • WD_PageviewsPerType runs on stat1007 daily but has been failing since February 17th - It runs a spark job
  • WD_UsageCoverage runs on stat1008 daily
  • WD_languagesLandscape runs on stat1008 monthly (30th of the month)
  • Wiktionary_CognateDashboard runs on stat1008 daily
  • WDCM_EngineBiases runs on stat1008 weekly
  • Qurator_CuriousFacts runs on stat1008 monthly (10th of the month)
  • WMDE_BannerImpressions runs on stat1008 hourly
  • NewEditors_comprehensive_report runs on stat1008 daily

Similar issues

Event Timeline

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE added a subscriber: JAllemandou.

Hi @AndrewTavis_WMDE,
I've done some investigation, and here is what I have: Goran has 11 CRON jobs running from various hosts on our system (1on stat1004, 2 on stat1007, 7 on stat1008).

  • WDCM_Sqoop_Clients runs on`stat1004` weekly - It doesn't run spark (but Sqoop)
  • 2021_WMDE_Mitmachen_Bereich_2021_Campaign runs on stat1007 daily - It doesn't run spark (but Hive)
  • WD_PageviewsPerType runs on stat1007 daily but has been failing since February 17th - It runs a spark job
  • WD_UsageCoverage runs on stat1008 daily - It runs a spark job
  • WD_languagesLandscape runs on stat1008 monthly (30th of the month) - It runs a spark job
  • Wiktionary_CognateDashboard runs on stat1008 daily - It doesn't run spark
  • WDCM_EngineBiases runs on stat1008 weekly - It runs a spark job
  • Qurator_CuriousFacts runs on stat1008 monthly (10th of the month) - It runs a spark job
  • WMDE_BannerImpressions runs on stat1008 hourly - It doesn't runspark (but Hive)
  • NewEditors_comprehensive_report runs on stat1008 daily - It runs a spark job

We need to meet and talk about your usage of the data generated by those scripts, and see what you wish us to try to make work versus stop.
I'm booking some time on your calendar next Monday :)

Cross posting from the issue mentioned above :)

AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

@Manuel, moved this to needs product input as I think that we have everything that we could map out (within reason). Let me know how you'd like to prioritize things from here :)