The goal of this task is to create a single unified dashboard to monitor metrics of interest of [[ https://www.mediawiki.org/wiki/Content_translation | Content translation ]] workflows. The final dashboard should meet the following requirements:
* A single unified dashboard going forward for all metrics of interested related to CX.
* Public: accessible to both the Language team and the community
* anything that shouldn't be public can be monitored privately, but most of the data is public.
* The data should be updated real-time (ideally every 1 hr)
**Background**
Currently, multiple dashboards and reports are used by the Language team to monitor CX usage and also share with the community. They are:
* [[ https://superset.wikimedia.org/superset/dashboard/119/ | Content translation key metrics ]] (private dashboard)
* Special:ContentTranslationStats on each wiki (public dashboard)
* [[ https://te.wikipedia.org/wiki/%E0%B0%AA%E0%B1%8D%E0%B0%B0%E0%B0%A4%E0%B1%8D%E0%B0%AF%E0%B1%87%E0%B0%95:ContentTranslationStats | tewiki example]]
* [[ https://www.mediawiki.org/wiki/Content_translation/Deletion_statistics_comparison | CX deletion stats ]] (public report; quarterly)
* [[ https://nbviewer.org/github/wikimedia-research/machine-translation-service-analysis-2022/tree/main/ | Machine translation service usage analysis]] (public report; ad-hoc)
* [[ https://superset.wikimedia.org/superset/dashboard/cx-abuse-filter/ | CX abuse filter events ]] (private dashboard)
* [[ https://kcvelaga.quarto.pub/cx-mobile-entry-points-funnel-analysis-v1-jan-2024/ | CX user funnel metrics ]] (public report; ad-hoc)
This creates a very fragmented view of CX usage, increased maintenance burden and also issues like {T325790}.
The first version of dashboard should at least unify metrics from CX key metrics dashboard, Special:CXStats and deletion stats.
**Suggested solution**
* To use the [[https://superset.wmcloud.org/login/ | public instance of Superset ]], maintained the Wikimedia Cloud Services Team, which is accessible by anyone with a Wikimedia SUL account.
* To be able to use this, we are dependent on {T348407} (as of 24 May 2024, the task is in progress!)
* This is because, even if CX tables are added to Wiki replicas (T196020), each wiki_db will have their own CX tables, and similar to Quarry, Superset's SQLLab and charts can only access one database at a time.
* If Superset can access ToolDB, we can have a db for CX tables (with combined data from all wikis) that can be accessed.
* Related (not a dependency): {T336522}
* Regarding the data pipeline:
* For the core of the pipeline, [[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow | Airflow ]] can be used to orchestrate.
* Key data sources are: [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history | mediawiki_history ]], [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Edit_hourly | edit_hourly ]] & MariaDB CX tables
* Required MariaDB CX tables such as [[ https://www.mediawiki.org/wiki/Extension:ContentTranslation/cx_translations_table | cx_translations ]], [[ https://www.mediawiki.org/wiki/Extension:ContentTranslation/cx_translators_table | cx_translators ]] & [[ https://www.mediawiki.org/wiki/Extension:ContentTranslation/cx_corpora_table | cx_corpora]] can be sqoop-ed into Data Lake (example: T341725)
* Publish the data to https://analytics.wikimedia.org/ and sync to ToolsDB
**Steps involved**
//(sub-tasks to be created as required)//
| Step | Team(s) involved | Task(s) | Notes
| ---- | ---- | ---- | ---- |
| Identify all the metrics (+dependent data sources) to be tracked with v1 of the dashboard | Product Analytics, Language | T366044 |
| Create a basic sketch of dashboard design (placement of numbers, charts, tabs etc.) | Product Analytics, Language | | [[ https://superset.wikimedia.org/superset/dashboard/119/ | CX key metrics dashboard ]] can be used as reference.
| Sqoop necessary CX tables to Data Lake | Product Analytics, Data Engineering | T366867, T366868, T366869 | |
| Create an Airflow ETL pipeline to calculate the identified metrics | Product Analytics | T287306 | |
| Privacy review of the data to be published (per [[ https://foundation.wikimedia.org/wiki/Legal:Data_publication_guidelines | data publication guidelines ]]) | Privacy (L3SC request) | |
| Write queries for charts, from the aggregated data | Product Analytics | | |
| Create required charts | Product Analytics | | |
| Development and publication of the dashboard | Product Analytics | | |
| Communication to communities | Language | |