Page MenuHomePhabricator

Create an ETL job to aggregate Content Translation metrics for dashboarding
Open, MediumPublic


The Language team is interested in tracking a number of advanced metrics for Content Translation, such as the number of translators, the new translator retention rate (T226170, T194641), and overall deletion rates of translations (T286636).

However, most of these metrics cannot be reliably calculated over a large time range (such as a year) within the 1 minute Superset query timeout. In addition, the new Content Translation data stream (T231316) will be even larger, making it even harder to compute any metrics (such as translation completion rate) within the timeout.

In the absence of architectural improvements to Superset such as asynchronous queries, the only way to dashboard these metrics in Superset is to create an ETL job which will periodically calculate these metrics and save them to the Data Lake.

The current tool for this is Oozie, but work is planned to replace Oozie with Airflow (T271429).