Page MenuHomePhabricator

Measure usage and impact of Flores
Closed, ResolvedPublic

Description

Once Flores is deployed (T298584) we want to get data about its usage and impact on translations (in terms of quantity and quality).

Some aspects we may want to measure:

  • How often Flores used in the different languages where it is available (maybe compared with other services available on those languages).
  • How many translations are published using Flores (also as percentage of the total for the language).
  • Which is the deletion rate for the articles created using Flores.
  • How much is the initial translation modified by users when using Flores.

The above is just an initial proposal, we can adjust the exact measurements based on technical feasibility or considering better alternatives.

Data source: cx_corpora table and cx_translations table

Event Timeline

Pginer-WMF triaged this task as Medium priority.Jan 21 2022, 2:28 PM
Pginer-WMF moved this task from Backlog to Priority Backlog on the Language-analytics board.
Pginer-WMF moved this task from Needs Triage to MT on the ContentTranslation board.

@Pginer-WMF
Here are the initial results looking at the usage of Flores where the new mt service engine is currently deployed (languages supported). Please see the summary below and let me know if you have any questions. Here is the full report for additional details on how this data was collected.

Data Source: Data comes from the cx_corpora and cx_translations tables. I reviewed the use of Flores in comparison to other MT services available at the target languages since deployment of Flores on 1 February 2022 through 4 April 2022.

How often is Flores used in the different languages where it is available compared with other services available on those languages?

flores_usage_compare_pct.png (2×4 px, 235 KB)

Observations:

  • As shown in the chart above, Flores has been used for a small percentage of translations published since deployment at Igbo (2%) and Chinese (2%) languages and slightly higher percentage at Zulu (13%) since it's deployment. Google is still the primary machine translation service used at these languages.
  • Flores has been used to translate 31% of published translations for the Icelandic target language since deployment.
  • Flores is used for 100% of published translations at Occitan (Note: There's only been 1 published cx translation at Occitan since deployment of Flores) and the majority of translations (85%*) at Luganda, which is expected as it was enabled by default for these languages.

*(13% of translations at Luganda were identified as having no mt service being used (cxc_origin = 'scratch')

Daily usage of Flores

flores_usage_overtime.png (2×4 px, 229 KB)

Since the deployment of the service, there has been a gradual increase in the average daily number of translations that used Flores at all target languages where available. Note: There was a significant spike in translations on 19 March 2022, where there were 32 translations on 1 day at Luganda.

How many translations are published using Flores (also as a percentage of the total for the language)?

Number and percent of translations published using Flores since deployment of the new MT Service:

Target languageNumber CX TranslationsPercent of all CX translations for the language
Igbo332.23 %
Icelandic1530.61 %
Luganda6084.51 %
Occitan1100 %
Chinese432.28 %
Zulu213.33 %

How much is the initial translation modified by users when using Flores?

Across all target languages where available:

Percent MT modifiedNumber of translationsPercent of translations
less than 10%7746.39%
between 11 and 50%7042.17%
over 50%1911.45%

The majority of translations (88%) published using Flores were modified less than 50% by users at all target languages where available, with half of those modified less than 10% and half between 11 and 50%. 11% of Flores translations were modified over 50%.

By target language:

flores_modification_bytarget (1).png (2×4 px, 301 KB)

With the exception of Icelandic and Occitan (which only had 1 published translation), the largest number of published translations at each target language were modified less than 10% by users.

  • Deletion rate of CX articles at target languages where Flores is available

Since Flores was deployed, articles created using content translation have only deen deleted at Icelandic and Chinese target languages. These deletion ratios are below the deletion ratio for articles created without using content translation during this time period. (Note: This is not currently isolated to just articles created using Flores. I ran into complications isolating the deletion data to just articles created with Flores that will need to be investigated further if needed).

@Pginer-WMF
Here are the initial results looking at the usage of Flores where the new mt service engine is currently deployed (languages supported). Please see the summary below and let me know if you have any questions. Here is the full report for additional details on how this data was collected.

Thanks Megan, this is great!

Just one observation:

Daily usage of Flores

flores_usage_overtime.png (2×4 px, 239 KB)

The image used in the "daily usage" section is the same as the previous section. So I guess the specific image to illustrate this may got lost in their way to Phabricator.

@Pginer-WMF The daily usage of Flores chart has been updated with the correct one in my comment above. Thanks for letting me know!

Per discussions with @Pginer-WMF, I am to further investigate how (if possible) to identify the deletion ratios for articles specifically created using Flores.

@santhosh
I'm trying to find a way to link data within the cx_translations and cx_corpora table to mediawiki_history. This is needed to determine the percent of articles created with Flores that are deleted.

Do you know the best way to do this or if this is possible? I looked at joining the target_revision_id to revision_id in mediawiki_history but found that many (about 25%) of the cx translations are missing target_revision_id or did not have a matching revision_id in mediawiki_history. Please let me know if you have any suggestions or ideas to further investigate.

Thank you!

If target_revision_id is null, then the article is a draft that has not been published. The translations and corpora tables included both published and unpublished translations. We automatically purge unpublished translations (or more accurately: *never* published) after they have been unchanged for more than a year. Translations may stay unpublished because they are still work in progress, the original author has forgotten but not deleted it, or there was an error that prevented publishing.

If corresponding revision_id is not found in the target wiki, most likely the published translation has been deleted after publishing.

Comparing with https://en.wikipedia.org/wiki/Special:ContentTranslationStats, 25% does seem realistic when looking ratios of published, drafts in progress and deleted.

If target_revision_id is null, then the article is a draft that has not been published. The translations and corpora tables included both published and unpublished translations. We automatically purge unpublished translations (or more accurately: *never* published) after they have been unchanged for more than a year. Translations may stay unpublished because they are still work in progress, the original author has forgotten but not deleted it, or there was an error that prevented publishing.

Thanks @Nikerabbit. That context helps clarify the data I'm seeing. I did confirm that all published translations have a target_revision_id logged in the cx_translations table and a corresponding revision id in mediawiki_history if it was created prior to April 2022 (at the time of this analysis the `2022-04 mediawiki snapshot was not yet available). See Flores deletion rate update below:

What is the deletion rate for the articles created using Flores?
There were a total of 73 articles created using Flores by the end of March 2022 across all target languages where deployed. None of these articles were deleted during this timeframe.

Note: The analysis is limited to data available at the time of this analysis in the '2022-03' mediawiki_history snapshot. As a result, articles created or deleted after March 2022 are not included. This analysis can be rerun once the next snapshot is available to include more published articles and a longer article review period, which should provide a more accurate analysis of the Flores deletion rate.

I've also updated the report to add some additional details on current Flores usage for both draft and published translations, revised a few visualzations to improve clarity, and updated data to reflect usage through 29 April 2022. See summary of additional updates below and full report for details:

How often is Flores used in the different languages where it is available compared with other services available on those languages?

Published Translations

flores_usage_compare_pct_pub.png (2×4 px, 249 KB)

Draft Translations

flores_usage_compare_pct_draft.png (2×4 px, 264 KB)

Observations:

  • While Flores has not been used for any published translations for Zulu, 35% of its draft publications have used Flores.
  • We still see a higher usage of Flores for the Icelandic target language compared to other target languages (except for Occitan and Luganda where it is default). Since deployment through 29 April 2022, Flores was used to translate 37% of published translations for the Icelandic target language.
  • When looking at draft (in-progress ) translations, we see a higher percentage of Flores usage for all the target languages, except for Icelandic which has a higher percentage of published translations (37%) made with Flores than draft translations (27%).

Number of daily translations created using Flores (updated through 29 April 2022)

flores_usage_byday.png (2×4 px, 226 KB)

  • The daily number of translations published using Flores has continued to increase.
  • On 22 April 2022, the number of translations increased from 17 to 32 translations a day with a peak of 59 published translations on 24 April 2022. A breakdown by target language confirms that this late April increase is primarily due to usage for the Igbo target language where it was deployed as default in T305125. We have not yet seen any increases in published translations at Zulu or Icelandic where it was also deployed as default.

How many translations are published using Flores (also as percentage of the total for the language)?

Percent of translations created with Flores at each target language (1 Feb 2022 through 29 April 2022)

Screen Shot 2022-05-02 at 10.08.02 PM.png (690×1 px, 107 KB)

How much is the initial translation modified by users when using Flores?

Across all target languages where available:

Percent MT modifiedNumber of translationsPercent of translations
less than 10%22771.84%
between 11 and 50%7222.78%
over 50%175.38%

flores_modification_bytarget (1).png (2×4 px, 263 KB)

  • A breakdown by target languages shows that Igbo Wikipedia is actually the only target language where the majority of published translations are modified less than 10% (89% of published Igbo translations). For the other target languages, the majority of translations are modified between 11 and 50%.

cc @Pginer-WMF