Page MenuHomePhabricator

Create an ETL job to aggregate Content Translation metrics for dashboarding
Open, MediumPublic

Description

The Language team is interested in tracking a number of advanced metrics for Content Translation, such as the number of translators, the new translator retention rate (T226170, T194641), and overall deletion rates of translations (T286636).

However, most of these metrics cannot be reliably calculated over a large time range (such as a year) within the 180 second Superset query timeout. In addition, the new Content Translation data stream (T231316) will be even larger, making it even harder to compute any metrics (such as translation completion rate) within the timeout.

In the absence of architectural improvements to Superset such as asynchronous queries, the only way to dashboard these metrics in Superset is to create an ETL job which will periodically calculate these metrics and save them to the Data Lake.

The current tool for this is Oozie, but work is planned to replace Oozie with Airflow (T271429).

Task Requirements:

  • Create ETL Job
  • Update the Unified Experience Dashboard, incuding translator data, to use this aggregate dataset. This will enable all the charts to run more efficiently.

Event Timeline

ldelench_wmf moved this task from Triage to Current Quarter on the Product-Analytics board.

Update: An ETL job is also being discussed as an option to make some of the CX production table replicas available via hive. This would provide an option to query this data from within Superset and get more frequent updates. Currently, the content translation metrics dashboard relies on edit_hourly which is updated monthly.

Recommended next steps:

  • Get clarity on specific metrics we'd like to have available within the private Superset option
  • Work to set up new data pipeline using Airflow to get prioritized metrics available to query within Superset.