Page MenuHomePhabricator

Setup pipelines to load CX extension tables into Data Lake, at wmf_product
Closed, ResolvedPublic

Description

The following tables will need to be loaded for the metrics planned as of now in T366044. The tables are:

  • cx_translators
  • cx_translations
  • cx_corpora

The idea is to have Spark jobs (similar to sqoop) that will simply fetch and load to the destination tables, and have separate pipelines for necessary aggregations.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:17:05Z] <kcvelaga@deploy1003> Started deploy [airflow-dags/analytics_product@22aa307]: T393561

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:17:59Z] <kcvelaga@deploy1003> Finished deploy [airflow-dags/analytics_product@22aa307]: T393561 (duration: 01m 10s)

KCVelaga_WMF changed the task status from Open to In Progress.May 19 2025, 9:02 PM
KCVelaga_WMF triaged this task as Medium priority.
KCVelaga_WMF moved this task from Incoming to Priority on the LPL Analytics board.
KCVelaga_WMF moved this task from Priority to In progress on the LPL Analytics board.
KCVelaga_WMF changed the status of subtask T393560: Data pipeline to load cx_corpora to Data Lake, at wmf_product from Open to In Progress.