Page MenuHomePhabricator

Improvements to mediawiki_geoeditors_monthly dimensions
Open, Needs TriagePublic

Description

Background

The mediawiki_geoeditors_monthly data cube in Turnilo/Superset currently has the following dimensions:

  • Activity Level
  • Country Code (2-letter ISO)
  • Users Are Anonymous
  • Wiki Db

Request

Over the past year I've have observed that a lot of staff are immensely interested in this dataset but this limited set of dimensions makes it rather difficult for them to use the dataset to its fullest potential and explore it with ease.

Additions

I'd like to request the following (derivable) dimensions to be added:

  • Country Name (canonical_data.name in Data Lake)
  • Continent (canonical_data.maxmind_continent in Data Lake)
  • Economic Region (canonical_data.economic_region in Data Lake)
  • Wiki Db Group (canonical_data.database_group in Data Lake)
  • Wiki Language Name (canonical_data.language_name in Data Lake)
  • Wiki Language Code (canonical_data.language_code in Data Lake)
  • Wiki English Name (canonical_data.english_name in Data Lake)

Modifications

Furthermore, "Users Are Anonymous" dimension is currently "0" and "1" which makes for awkward visualization:

TurniloSuperset
Screen Shot 2022-02-18 at 11.05.26 AM.png (506×483 px, 20 KB)
Screen Shot 2022-02-18 at 11.11.33 AM.png (451×181 px, 18 KB)

I'd like to propose changing the values/labels to "Anonymous" and "Logged-in", respectively.

Event Timeline

I wonder whether the requested change should be done for the data in Druid only, or if it would be valuable to change the geoeditors tables on the cluster (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors).
Depending on the above, we should plan on doing this when migrating the geoeditors jobs to aiflow (some of it already done), or when migrating the druid loading jobs to airflow.

Hi @JAllemandou ! I'm moving this task back to DE Workboard since we depend on Airflow migration (either druid or geoeditors). Once that has taken place, we'll be able to retake this one in Kanban.

I wonder whether the requested change should be done for the data in Druid only, or if it would be valuable to change the geoeditors tables on the cluster.

If those are separate (Druid version not just an ingestion of geoeditors in Hive) I'd say Druid is higher priority. But yes, the changes would be valuable across the board.