Page MenuHomePhabricator

Metrics for SDoC: translations
Closed, ResolvedPublic8 Estimated Story Points

Description

Translations

  • how many files/descriptions are in multiple languages?
    • Might need to use the description field
  • We need to make things more accessible for other languages (grant requirement)
    • Where are we at with languages?
    • How many files are in lang X?
    • How many have multiple languages in them?
    • How many Western industrialized languages?
  • Looking for a benchmark to judge later growth on
    • How many search queries happen in what languages?
      • ie: 8% of searches were done in Bengali, but the descriptions of the files using Bengali were only 0.5%

Event Timeline

mpopov set the point value for this task to 8.

We parsed the wikitext of all files in Commons xml data dumps of November 20, 2017, and extract the language templates in them (e.g. {{en}}, {{LangSwitch}}). Out of the total 43,268,565 files, 14,848,551 (34.32%) files don't have any language templates, 23,780,247 (54.96%) files use only 1 language.

files_by_n_languages.png (600×1 px, 48 KB)

40.1% of all files have English templates, 9.38% of files use German, and 6.2% of files have description in languages which are not in the top 20.

top20_languages_nfiles.png (600×1 px, 94 KB)

For those files without language template, we use the langdetect package to detect their languages. We cannot detect any language in 556,684 files (1.29% of all 43,268,565 files). We detect 1 language for 7,577,789 (17.51%) files.

files_by_n_detected_languages.png (600×1 px, 53 KB)

We detect English in 30.25% of all 43,268,565 files, detect German in 3.93% of files.

top20_detected_languages_nfiles.png (600×1 px, 100 KB)

Results and analysis codebase: https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177358-1