The goal of this task is to create a single unified dashboard to monitor metrics of interest of Content translation workflows. The final dashboard should meet the following requirements:
- A single unified dashboard going forward for all metrics of interested related to CX.
- Public: accessible to both the Language team and the community
- anything that shouldn't be public can be monitored privately, but most of the data is public.
- The data should be updated real-time (ideally every 1 hr)
Background
Currently, multiple dashboards and reports are used by the Language team to monitor CX usage and also share with the community. They are:
- Content translation key metrics (private dashboard)
- Special:ContentTranslationStats on each wiki (public dashboard)
- CX deletion stats (public report; quarterly)
- Machine translation service usage analysis (public report; ad-hoc)
- CX abuse filter events (private dashboard)
- CX user funnel metrics (public report; ad-hoc)
This creates a very fragmented view of CX usage, increased maintenance burden and also issues like T325790: Special:ContentTranslationStats is slow and getting crowded.
The first version of dashboard should at least unify metrics from CX key metrics dashboard, Special:CXStats and deletion stats.
Suggested solution
- To use the public instance of Superset, maintained the Wikimedia Cloud Services Team, which is accessible by anyone with a Wikimedia SUL account.
- To be able to use this, we are dependent on T348407: Allow Quarry to query ToolsDB public databases (as of 24 May 2024, the task is in progress!)
- This is because, even if CX tables are added to Wiki replicas (T196020), each wiki_db will have their own CX tables, and similar to Quarry, Superset's SQLLab and charts can only access one database at a time.
- If Superset can access ToolDB, we can have a db for CX tables (with combined data from all wikis) that can be accessed.
- Related (not a dependency): T336522: Public viewing of superset
- Regarding the data pipeline:
- For the core of the pipeline, Airflow can be used to orchestrate.
- Key data sources are: mediawiki_history, edit_hourly & MariaDB CX tables
- Required MariaDB CX tables such as cx_translations, cx_translators & cx_corpora can be sqoop-ed into Data Lake (example: T341725)
- Publish the data to https://analytics.wikimedia.org/ and sync to ToolsDB
Steps involved
(sub-tasks to be created as required)
Step | Team(s) involved | Task(s) | Notes |
---|---|---|---|
Identify all the metrics (+dependent data sources) to be tracked with v1 of the dashboard | Product Analytics, LPL | T366044 | |
^ list down all available metrics for future consideration | Product Analytics | T366044 | |
Create a basic sketch of dashboard design (placement of numbers, charts, tabs etc.) | Product Analytics, LPL | CX key metrics dashboard can be used as reference. | |
Product Analytics, Data Engineering | T366867, T366868, T366869 | ||
Jobs scripts & Airflow DAGs to load required CX extension tables to Data Lake | Product Analytics | cx_translations, cx_translators, cx_corpora | |
Queries required to calculate metrics required for v1 of the dashboard | Product Analytics | ||
Airflow DAGs to aggregate metrics required for v1 of the dashboard | Product Analytics, Data Engineering | includes publication to analytics/published/datasets | |
Toolforge jobs to load to data to ToolsDB | Product Analytics | ||
T287306 | |||
Privacy review of the data to be published (per data publication guidelines) | Privacy (L3SC request) | ||
Write queries for charts, from the aggregated data | Product Analytics | ||
Create required charts | Product Analytics | ||
Development and publication of the dashboard | Product Analytics | ||
Communication to communities | LPL |