Page MenuHomePhabricator

Automatic evaluation of the limit system algorithm
Open, MediumPublic

Description

The current system of limits in Content Translation includes relevant considerations to encourage users to edit the initial automatic translations, but it also has important limitations. For example, any translation with less than 50 paragraphs can be published even if it includes 95% of unmodified MT.

When proposing improvements to the system or alternative approaches for how modifications are calculated (T251893) or the limits applied (T245840) it is hard to anticipate how well will those work for most translations in different languages. This ticket proposes to create an automatic process for evaluating a limit system against a collection of translation samples.

The idea is to have two sets of translation samples: One with good translations that are expected to be allowed to publish, and another one with low-quality translations that are expected for the tool to prevent their publishing.

The process would apply a given algorithm of the limit system to all those samples and inform how many of the samples in each category were allowed, warned and prevented from publishing.

The translation samples can use the same format used for the parallel corpora to make it possible to extract samples from it (after human confirmation of them being hi or low quality), or create them as test cases (editing intensively or lightly the MT). Some of those samples could be obtained also as a byproduct of the upcoming research on translation quality (T288012) or user reports (T305814).

Event Timeline

Pginer-WMF created this task.
Pginer-WMF updated the task description. (Show Details)

To better understand what is proposed, could you please clarify the following?

  • Are these algorithms evaluated offline with a previously collected original content, translated content, MT template combination? Or are we doing some kind of A/B testing where all algorithms are deployed in production and then we measure what happens?
  • Is this evaulation independent of translator and translators history of contributions, deletions?
  • What are the algorithms we are evaluating if they are matching the human expectation?
  • What are the pre-requisites before this evaluation? Collection of good translation, bad translation? If so what defines good and what defines bad for a language? Is this based on human judgement? If so is it single human judgement?

To better understand what is proposed, could you please clarify the following?

Good questions. Sharing my answers below based on my initial ideas (but happy to consider better alternatives)

  • Are these algorithms evaluated offline with a previously collected original content, translated content, MT template combination? Or are we doing some kind of A/B testing where all algorithms are deployed in production and then we measure what happens?

I was thinking on offline evaluation. Running the algorithm over a previously collected set of samples.

  • Is this evaulation independent of translator and translators history of contributions, deletions?

Mainly, yes. I think it makes sense to treat the data as independent of who created it. For example, I can set an example translation, and it would be great that it leads always to the same result of the evaluation.
Since some algorithms consider additional details (e.g., whether the user had previous deletions), we could include those as parameters of the test. For example, we can run an algorithm for all samples twice, one simulating a regular user and another one simulating a user with previous deletions. However, I'd not considering taking data from real users since that makes the evaluation less predictable.

  • What are the algorithms we are evaluating if they are matching the human expectation?

Yes. We try to make the human expectation to be more specific with a set of examples that illustrate it. If a Chinese Wikipedia editor points to a bad translation that our system is not flagging we can incorporate to the list of samples, and for any improvement we make to the system in order to better deal with that case, we will be more confident that problems are not created for the cases represented by the other existing samples.

  • What are the pre-requisites before this evaluation? Collection of good translation, bad translation? If so what defines good and what defines bad for a language? Is this based on human judgement? If so is it single human judgement?

Yes, we need a collection of good and bad translations. What is good and bad, is defined by human judgement. In the same way that we tried to capture this judgement in the limit rules (e.g. 100% unedited translation is likely bad), we will be capturing it in a series of examples. We can start with some samples, and increase coverage as we collaborate with communities to improve how the limits work for them. The more languages covered and people involved, the better.