Page MenuHomePhabricator

Identify languages where each MT service is most popular
Closed, ResolvedPublic

Description

Content Translation supports multiple MT services. When multiple options are available for a language, even if one is provided by default, users can use a different service. Having more visibility on the most popular service used on a language, would be helpful for different purposes such as adjusting the defaults.

This ticket proposes to analyze which is the most popular service used for each language. For a given MT service we want to know which are the languages that it is helping to support the most.

The specific process and format will be defined as part of this task. Since it may require the analysis of parallel corpora data, some of the work done for T290906 may be reused and T299769.

MT availability and language support

Result: Published report

Event Timeline

Pginer-WMF moved this task from Backlog to Priority Backlog on the Language-analytics board.

@KartikMistry for this task it would be useful to know which is the default MT service given a language pair. Can you confirm the which is the best way to find out which is the default for a given language pair?

My understanding is that in general the default services follows alphabetical order from those available, with the "defaults" configuration file capturing the cases where a different service should be used instead. I vaguely recall there may be a way to use the cxserver API to get the list of services that can be used to figure out the default. Is that the case?

@KartikMistry for this task it would be useful to know which is the default MT service given a language pair. Can you confirm the which is the best way to find out which is the default for a given language pair?

My understanding is that in general the default services follows alphabetical order from those available, with the "defaults" configuration file capturing the cases where a different service should be used instead. I vaguely recall there may be a way to use the cxserver API to get the list of services that can be used to figure out the default. Is that the case?

All available language pairs with MT including default MT can be retrieved from: https://cxserver.wikimedia.org/v2?doc#/Service%20information/get_v1_list__tool_ (Note: Click on 'Try it out' and then 'Execute' will bring the result, which can be downloaded in JSON file format if needed.)

MNeisler subscribed.

@Pginer-WMF
Here is the current report on machine translation service usage.

Data
Data reviewed in the analysis was limited to published translations that were started within the last few months (1 February 2022- 19 May 2022). This was done to provide a snapshot of how all available MT services are currently used.

Machine translation services: Comparison of usage report.
The report provides results on the following metrics for each available machine translation service:

  • Percent of translations published by each machine translation service:
    • Overall across all languages
    • Daily usage trends
    • By Language Pair (Source - Target)
    • By Target Language
  • Percent each machine translation service was modified by users

Machine translation service usage by language pair results:
Due to the large number of language pairs, I decided that the data would be most easily viewable and sortable in a google spreadsheet.
Here is a link to the MT Usage by Language Pair Spreadsheet, which can be used to determine the number and percent of publications created by each available MT service at each language pair.
Directions:
To use the spreadsheet, select the source and target langue and it will provide a breakdown of the published translation by MT service for the selected language pair.

Please let me know if you have any questions or any additional metrics that would be useful to investigate further. Thanks!

Codebase

Thanks @ MNeisler for helping to get an overview on the usage of MT services. This is really really great!

I had only one immediate question related to "Percent machine translation content is modified". There, Flores and Opus have a high percentage of slightly modified translations compared to the rest. However, the rest of the services support a broader set of languages. So I wonder how the comparison would look for a common set of languages. For example, in T299769#7899739 was mentioned that 89% of published Igbo translations were modified less than 10%, are the numbers similar/different for the translations based on Google for Igbo.

Moving back to doing to address suggested revisions to the report.

@Pginer-WMF

I've updated the report with your suggested revisions including additional analysis in the "Percent machine translation content is modified" section to determine if the difference in modification rates is primiarly due to the wiki or due to the machine translation service. Per you suggestion, I started by looking at only Flores-supported languages.

Based on a breakdown by target language, it appears that the differences in modification rates are impacted more by the target language than the machine translation service. For example, Igbo has high percentage of slightly modified (less than 10%) translations for both Flores and Google translations and the majority of Chinese translations are modified between 10 and 50% across all available machine translation services.

Additionally, the percent each machine translation service is typically modified can vary across target languages. At other target languages where available, Flores translations are modified between 10 and 50%.

Note: Some of these trends (especially for English to Icelandic where there were only 16 Flores and 26 Google translations in the reviewed timeframe) will be easier to confirm once we have more data as more Flores translations are published.

flores_mt_modification_bytarget.png (2×4 px, 275 KB)

@Pginer-WMF

I've updated the report with your suggested revisions including additional analysis in the "Percent machine translation content is modified" section to determine if the difference in modification rates is primiarly due to the wiki or due to the machine translation service. Per you suggestion, I started by looking at only Flores-supported languages.

Based on a breakdown by target language, it appears that the differences in modification rates are impacted more by the target language than the machine translation service. For example, Igbo has high percentage of slightly modified (less than 10%) translations for both Flores and Google translations and the majority of Chinese translations are modified between 10 and 50% across all available machine translation services.

Additionally, the percent each machine translation service is typically modified can vary across target languages. At other target languages where available, Flores translations are modified between 10 and 50%.

Note: Some of these trends (especially for English to Icelandic where there were only 16 Flores and 26 Google translations in the reviewed timeframe) will be easier to confirm once we have more data as more Flores translations are published.

flores_mt_modification_bytarget.png (2×4 px, 275 KB)

Thanks @MNeisler. This is really useful. It seems that there are both types of differences: different languages have different levels of MT editing, and for a given language in some cases also a difference across services. Something to keep an eye on for a longer period where more translations are available.