Page MenuHomePhabricator

Correlation between article length, number of translations within a time period, experience of users, and deletion rate.
Closed, ResolvedPublic

Description

@Pginer-WMF: Does this question relate to any of the hypothesis work your team is doing? If so, can you please share with hypothesis?

This is connected with the work the Language team is doing around MinT. The current hypothesis is "Scaling Open Translation service will increase page interactions from underserved communities" and it is part of the Key Result "WE2.2 Interested readers will discover and browse more content".

As machine translation is exposed to more users with an option to contribute we want to understand better the factors that affect the quality of the content created. Based on the experience of Content Translation we have identified different potential issues based on community reports and anecdotal evidence that data can help to clarify.

What team/program is this request for?
Language Team.

What are you requesting?

We want to better understand which are common traits present in low-quality translations where machine translation is used.
We want to analyze common factors that have been associated with low quality tranlations:

  • Translations over a short period of time. This is commonly associated with campaigns/contests where some users may be incentivized to create a large number of articles without enough emphasis on quality.
  • User expertise level. Communities have requested to limit access to Content Translation or Machine Translation to be accessed only by experienced users. This comes with the assumption that problematic translations are mainly produced by the less experienced users. An assumption we want to check and put in perspective.
  • Length of the content. How long is the translation (in itself or with respect to the original article) is another factor that may signal low quality translations.

For measuring translation quality, we have used the article deletions as a proxy but additional signals can be considered too.

What is the problem you're trying to solve?
Understanding better when machine translation is misused for content creation helps us to adjust the prevention mechanisms to encourage good use of it.

What decision will you make or action will you take with the deliverable?

We plan to improve the translation limits system (T251887) and this analysis can be useful to (a) identify how to adjust the limits, and (b) set a baseline to identify improvements produced by the new limits.

In addition, as MinT is exposed to Wikipedia readers, options are provided to them to enter the editing path (contribute improved translation). This means that the translation activity will be exposed to a broader less experienced audience which may require additional guidance. Knowing the factors that affect translation quality will be useful to define the best approach to guide/encourage/discourage newcomers to translate in a certain context.

Event Timeline

@Pginer-WMF please add any additional questions that you are interested in, as part of this analysis.

KCVelaga_WMF moved this task from Triage to Current Quarter on the Product-Analytics board.
KCVelaga_WMF moved this task from Backlog to Priority Backlog on the Language-analytics board.

@Pginer-WMF: Does this question relate to any of the hypothesis work your team is doing? If so, can you please share with hypothesis?

Also, can you please provide answers to the questions listed at https://www.mediawiki.org/wiki/Product_Analytics#Teams_that_we_currently_support?

@Pginer-WMF: Does this question relate to any of the hypothesis work your team is doing? If so, can you please share with hypothesis?

This is connected with the work the Language team is doing around MinT. The current hypothesis is "Scaling Open Translation service will increase page interactions from underserved communities" and it is part of the Key Result "WE2.2 Interested readers will discover and browse more content".

As machine translation is exposed to more users with an option to contribute we want to understand better the factors that affect the quality of the content created. Based on the experience of Content Translation we have identified different potential issues based on community reports and anecdotal evidence that data can help to clarify.

I'm providing more details below but feel free to share any other question.

Also, can you please provide answers to the questions listed at https://www.mediawiki.org/wiki/Product_Analytics#Teams_that_we_currently_support?

What team/program is this request for?
Language Team.

What are you requesting?

We want to better understand which are common traits present in low-quality translations where machine translation is used.
We want to analyze common factors that have been associated with low quality tranlations:

  • Translations over a short period of time. This is commonly associated with campaigns/contests where some users may be incentivized to create a large number of articles without enough emphasis on quality.
  • User expertise level. Communities have requested to limit access to Content Translation or Machine Translation to be accessed only by experienced users. This comes with the assumption that problematic translations are mainly produced by the less experienced users. An assumption we want to check and put in perspective.
  • Length of the content. How long is the translation (in itself or with respect to the original article) is another factor that may signal low quality translations.

For measuring translation quality, we have used the article deletions as a proxy but additional signals can be considered too.

What is the problem you're trying to solve?
Understanding better when machine translation is misused for content creation helps us to adjust the prevention mechanisms to encourage good use of it.

What decision will you make or action will you take with the deliverable?

We plan to improve the translation limits system (T251887) and this analysis can be useful to (a) identify how to adjust the limits, and (b) set a baseline to identify improvements produced by the new limits.

In addition, as MinT is exposed to Wikipedia readers, options are provided to them to enter the editing path (contribute improved translation). This means that the translation activity will be exposed to a broader less experienced audience which may require additional guidance. Knowing the factors that affect translation quality will be useful to define the best approach to guide/encourage/discourage newcomers to translate in a certain context.

@Pginer-WMF the draft of the analysis report is ready for review at https://kcvelaga.quarto.pub/cx-deletion-rate-variables-2024/

I will be adding a few additions later in the week, such as appendix and references, but there shouldn't be any major change the main content.

This is excellent. Thanks @KCVelaga!
The study surfaces some factors that we expected to have an impact in deletions (with the report providing a better understanding of how much they do), as well as, some counter-intuitive results that will help us think on new ideas.

For some of the more surprising results it may be interesting to consider if we can learn a bit more without much analytics effort. Sharing some ideas below:

Translating to smaller wikis makes translations less likely to be deleted

There the report mentions the following about the causes:

Target Wikipedia Rank Bin: [...}While there might be various explanations for this, a likely one is that the smaller Wikipedias have very few active editors and there isn’t enough patrolling activity that keeps up with the rate at which newer articles are created, and in addition the baseline expectation of quality might vary, which influences the deletion outcome.

It would be great to take a closer look to those hypothesis. Looking at some specific data points, the above hypotheses do not seem to apply.
For example, based on the current year data, Bangla has a 1.3% deletion rate for translations (source), which is lower than the 6.8% from English (source). One could think that there are many unreviewed translations in Bangla Wikipedia. However, when looking at the deletion rate for articles not created with Content Translation, Bangla has a 21% deletion rate (source, which is higher than the 6.4% from English(source). The deletion rate on Bangla not being lower than English seems to suggest that there may not be a significant bag of unreviewed articles or lower baseline expectations with quality that could justify the difference.

Compared to articles started from scratch, on a given wiki, translations are expected to be at the same level of scrutiny or more as regular non-translated new articles. A reviewer looking for new articles will get both translations and non-translations, while there is also the possibility of using the translation-specific tag to focus on those too.

This will be something useful to check across the board and capture in the report, since often community members point to the possibility of a large number of unreviewed articles as a reason to consider the deletion rates not reliable.

Source articles conforming to standard quality result in higher deletions

The report points that:

The deletion rate is higher for translations from source articles which meet the standard quality criteria. Previously, we observed that with the sources articles’ size, the articles that were deleted had a higher average size of source articles, and as size one of the criteria for standard quality, it maybe contributing here as well.

It may be interesting to check this. It would be interesting to check different page size buckets for articles meeting the other quality criteria. That is, could we expect that shorter articles that meet the rest of the criteria (references, images, etc.) result in translations less likely to be deleted?

More unmodified machine translation results in lower deletions

There the report describes the following counter-intuitive result:

The current understanding is that higher proportion of machine translated content in the final publication might result in a bad quality translation (and thereby higher chance of deletion). Contrary to that, on average, for articles that were not deleted, the proportion of machine translation was higher and the proportion of human modification was lower. This indicates that human modification might not necessarily increase the translation quality. This holds true across all the user experience levels (edit buckets).

I was wondering whether machine translation quality differences across wiki can have an impact. That is, low quality MT with a high percentage of modifications may still not be good enough while a good MT quality initial point may require lass modifications. Maybe we can check if the analysis still holds true when applied to individual languages (where translation quality is expected to be more consistent).

Human and MT percentages aren't two sides of the same coin?

The study considers human and MT percentages. Intuitively, I'd expect those to be complementary (100% MT = 0% human). However, at the sections about variable relevance those have some differences. For example, MT percentage having a −4.36% in deletions and Human percentage a 7.38%, or having a different importance for newcomers.
I was wondering if this is just due to data fluctuations and limited sample sizes, or if there may be other related factors (e.g., lack of MT resulting in 100% human percentage).

Also, for future reference I wanted to capture in the ticket some of the quick take aways (feel free to correct if misinterpreting form the report):

  • Selecting longer source articles to translate makes the translation more likely to be deleted.
  • Publishing a longer translation as a result makes it less likely to be deleted.
  • Translations with a higher percentage of human modifications makes the translation more likely to be deleted.
  • Spending more time creating a translation makes it less likely to be deleted (for some user groups).
  • Creating many articles during a short period of time (15 days) makes the translations more likely to be deleted.
  • Selecting a source article that meets the standard quality criteria makes the translations more likely to be deleted.
  • Publishing a translation which contents meet the standard quality criteria makes the translations less likely to be deleted.
  • Having no machine translation available makes the translations more likely to be deleted.
  • Making a translation as the user first edit makes the translations more likely to be deleted.
  • Making a translation on mobile makes the translations more likely to be deleted.
  • Publishing a translation into a smaller wiki makes it less likely to be deleted.
  • Selecting a source article from a larger wiki makes the translations more likely to be deleted.

Another way to look into the above. If we were providing advice to translators on how to have a successful translation, it could be (in bold those points based on data that seem counter-intuitive):

  • Don't select a long article to translate (or one that meets the standard quality criteria), but spend time in the translation and write a longer translation.
  • Write a translation that meets the standard quality criteria.
  • Don't create many translations in a short period of time.
  • Translate into languages where Machine Translation is available.
  • Don't make a translation as your first edit.
  • Publish translations in smaller wikis.
  • Don't modify the initial machine translation significantly.
  • Don't translate on mobile.
  • Don't select an article to translate from a larger wiki.

@Pginer-WMF Thanks for reviewing the report and sharing your thoughts + takeaways. I am glad that it was useful.

Translating to smaller wikis makes translations less likely to be deleted: It would be great to take a closer look to those hypothesis. Looking at some specific data points, the above hypotheses do not seem to apply.

Yes, definitely. Thanks for pointing out the Bangla vs. English example. From the analysis, while translating smaller wikis results in less likelihood of deletion, it is not linear. This is only true for wikis smaller than top 20. Referring to this graph, the proportion of deleted article is high until the top 20 Wikipedias, then it starts to decline. According wiki-comparision, bnwiki is around rank 30. The statement is generalized across the rank bins, which may not hold true sometimes when looking at individual wikis. As you mentioned, it will be worth looking at non-CX articles deletion rate. A follow-up analysis can study this hypothesis in combination with the following: deletion of rate of non-CX article, number of active editors, and level of patrolling activity. In addition, we can also make the binning the ranks more granular.

Source articles conforming to standard quality result in higher deletions: It may be interesting to check this. It would be interesting to check different page size buckets for articles meeting the other quality criteria. That is, could we expect that shorter articles that meet the rest of the criteria (references, images, etc.) result in translations less likely to be deleted?

Yes, it will be worth extracting the features for standard quality criteria (page length, images, references, etc.) and see how they impact the deletion outcome. Currently, the hypothesis is that, if the source articles meet standard quality, then they are likely to be more complex (number of sections, more links, references etc.) which may not be an easy entry point for translation. The hypothesis we can try to check is if complexity of the source article increases the probability of the translation. But shouldn't necessarily stop us from suggesting such articles, but rather look at the quality of the translation and see how constructive feedback can be suggested to users.

More unmodified machine translation results in lower deletions: I was wondering whether machine translation quality differences across wiki can have an impact.

Yes, quality of machine translation between a language pair will impact this, which isn't accounted for, in the analysis. As we discussed during our last sync, currently there is no source except for scores such as BLEU to understand machine translation quality for a language pair, and even that has its short comings (for example Finnish, as you mentioned). A way that we could understand this further is by taking a random sample of language pairs from various groups (large, medium, small wikis), qualitatively asses their translation quality, and augment the current dataset with that information, which can help us for further understand what's happening with MT / human modification percentages.

Human and MT percentages aren't two sides of the same coin?: The study considers human and MT percentages. Intuitively, I'd expect those to be complementary (100% MT = 0% human). However, at the sections about variable relevance those have some differences. For example, MT percentage having a −4.36% in deletions and Human percentage a 7.38%, or having a different importance for newcomers.

I maybe misreading your question, please correct me if I am wrong. If you are referring to this table, −4.36% is the percentage change in probability of deletion when the machine translation percentage is increased by 15 percentage points (given an initial value of 30%). Similarly, when the human modification percentage increases by 10 percentage points from an initial value of 10% then the probability of deletion increases by 7.3%.

Human and MT percentages are complementary to each other i.e. ideally they should both sum to 100 for a given translation. But during the analysis, I have observed sometimes, that around 0.1-0.5% isn't accounted for.


From the takeaways and advice to translators,

Selecting longer source articles to translate makes the translation more likely to be deleted. Selecting a source article that meets the standard quality criteria makes the translations more likely to be deleted.
Don't select a long article to translate (or one that meets the standard quality criteria), but spend time in the translation and write a longer translation.

I would be cautious about this. Yes, the report suggests this from the data available, but may not be worth taking a decision unless we investigate further. Article length is part of the standard quality criteria, the increased probability of deletion may be associated with the complexity of the article. When translators try to translate a more complex article and the translation doesn't have the same level, it will result in increased deletion probability. Especially for less experienced users, taking up large/complex articles will lead to increased probability of deletion.

Publish translations in smaller wikis.

This is something beyond the quality of translation itself. As we discussed one of the reasons could be variation in expectations of quality. But that shouldn't necessarily mean we should advise against publishing to larger wiki.

Rest look good to me.


I can think of the following additions if whenever a follow-up analysis will be conducted (feel to free to suggest more that you think would be helpful):

  • How the individual criteria of standard quality criteria (both at source and target) impact the deletion rate? (length, intra-wiki links, references etc.)
    • This can also be helpful in providing more specific advise to translators.
  • How the machine translation quality influences the mt / human percentages, and thereby the deletion outcome?
    • for example: If the mt quality is very good, will too much human modification lead to increased deletion probability?
  • Augmenting the data with active editors count, patrolling metrics (such as recent changes & new page patrol), and non CX deletion percentages, to further understand why the deletion proportion is low on smaller wikis.
    • In addition, the size rank bins can be made more granular.

@Pginer-WMF Thanks for reviewing the report and sharing your thoughts + takeaways. I am glad that it was useful.

  • How the machine translation quality influences the mt / human percentages, and thereby the deletion outcome?
    • for example: If the mt quality is very good, will too much human modification lead to increased deletion probability?

Thanks @KCVelaga_WMF.
From the possible follow-up, the one I'd consider to start with would be the one about MT since it is related to the most counter-intitive result (editing more th einitial MT resulting in higher deletions). The approach proposed makes sense. However, I was wondering whether the step to "qualitatively asses the translation quality" may be a bit complex. Would it make sense to check the numbers for the articles translated into a given language as a separate group of results. For example, if we look at all the translations in Finnish, those are done with the same MT quality level: the MT quality level available for Finnish (whichever it is, even if we do not label as high or low).Doing the same for other languages, one at a time, would help to identify if results repeat evenly across the board, or there are some clusters (i.e., some languages where modifying the initial MT leads to more deletions and others where it does not).

At that point we can also decide on evaluating MT quality for languages in those groups, but it won't be a blocker to learn whether the counter-intuitive result happens across all languages or not. Maybe it is better to perform the evaluation upfront, just sharing an alternative approach for consideration.

Thank you for this analysis, KC!

I recommend starting a "follow-up research proposal" Google doc and collecting questions / further investigation points there? And then when that follow-up research is needed we would create a Phab/Asana task to track that work.