Create an ETL job to aggregate Content Translation metrics for dashboarding
Open, MediumPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Jul 23 2021, 9:43 PM

Description

The Language team is interested in tracking a number of advanced metrics for Content Translation, such as the number of translators, the new translator retention rate (T226170, T194641), and overall deletion rates of translations (T286636).

However, most of these metrics cannot be reliably calculated over a large time range (such as a year) within the 180 second Superset query timeout. In addition, the new Content Translation data stream (T231316) will be even larger, making it even harder to compute any metrics (such as translation completion rate) within the timeout.

In the absence of architectural improvements to Superset such as asynchronous queries, the only way to dashboard these metrics in Superset is to create an ETL job which will periodically calculate these metrics and save them to the Data Lake.

The current tool for this is Oozie, but work is planned to replace Oozie with Airflow (T271429).

Task Requirements:

Create ETL Job
Update the Unified Experience Dashboard, incuding translator data, to use this aggregate dataset. This will enable all the charts to run more efficiently.

Related Objects
Search...

Status	Assigned	Task
Resolved	nshahquinn-wmf	T226171 Dashboard 2019-20 annual plan metrics for the Language team
Resolved	Pginer-WMF	T226170 Measure newcomer retention after their first translation
Open	KCVelaga_WMF	T287306 Create an ETL job to aggregate Content Translation metrics for dashboarding

Event Timeline

nshahquinn-wmf created this task.Jul 23 2021, 9:43 PM

nshahquinn-wmf moved this task from Backlog to Priority Backlog on the Language-analytics board.Jul 23 2021, 9:46 PM

nshahquinn-wmf added a parent task: T226170: Measure newcomer retention after their first translation.

ldelench_wmf triaged this task as Medium priority.Jul 27 2021, 5:12 PM

ldelench_wmf moved this task from Triage to Current Quarter on the Product-Analytics board.

nshahquinn-wmf assigned this task to MNeisler.Jul 28 2021, 9:15 PM

nshahquinn-wmf moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.

MNeisler mentioned this in T310774: Reorganize analytics for a Content & Section Translation unified experience.Dec 7 2022, 3:07 AM

MNeisler mentioned this in T327247: Add graph for number of users in the unified stats dashboard.Feb 22 2023, 12:28 AM

MNeisler updated the task description. (Show Details)Feb 23 2023, 4:17 PM

MNeisler mentioned this in T226170: Measure newcomer retention after their first translation.Mar 9 2023, 4:07 PM

MNeisler moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.Apr 12 2023, 5:32 PM

MNeisler moved this task from Current Quarter to Upcoming Quarter on the Product-Analytics board.Jun 8 2023, 4:10 PM

MNeisler mentioned this in T325790: Special:ContentTranslationStats is slow and getting crowded.Jun 16 2023, 4:35 PM

Update: An ETL job is also being discussed as an option to make some of the CX production table replicas available via hive. This would provide an option to query this data from within Superset and get more frequent updates. Currently, the content translation metrics dashboard relies on edit_hourly which is updated monthly.

Recommended next steps:

Get clarity on specific metrics we'd like to have available within the private Superset option
Work to set up new data pipeline using Airflow to get prioritized metrics available to query within Superset.

MNeisler reassigned this task from MNeisler to KCVelaga_WMF.Sep 11 2023, 3:36 PM

Create an ETL job to aggregate Content Translation metrics for dashboardingOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Create an ETL job to aggregate Content Translation metrics for dashboarding
Open, MediumPublic
Actions

Related Objects
Search...