Page MenuHomePhabricator

Check if it is possible to import machine translated strings to translatewiki.net
Open, Stalled, LowestPublic

Description

Context

As we are going to scale our tools to more wikis, we are blocked in the translation process. On the contrary of other tools, the Growth tools aren't visible for experienced users (they target newcomers), hence they are not really getting translations.

The only way to get translations (with no guarantee to reach an 100% achievement) is to create an active community discussion. This can't scale to all wikis since we don't have the workforce to encourage all wikis individually. And not all communities reply to these calls.

We think it is better for newcomers to have an interface in their language, even if provided by a machine translation service; better than to have English.

Using machine translation doesn't mean that we skip the announcement to communities about a forthcoming deployment. They will be informed about the deployment weeks before it, with an opportunity to work on genuine translations. We expect this potential use of machine translation to encourage communities to work on translations, or, at least to fix them quickly. This machine translation process is a backup process, in case of an absence of translations being made ahead of time.

Task

This task to check the feasibility of the following:

  1. extract all strings needing translation (skip the ones already translated)
  2. find a way to translate them en masse
    • check if existing translations from translatewiki, used in the same context, can be used instead of machine translation
  3. import then to translatewiki.net, with a tag "to be checked" (if exists)

Check how ContentTranslation handles wikitext, and if there is a API for it.

Ways to pursue this task
On-the-fly machine translation

Instead of importing messages to translatewiki.net, we might expore the possibility of translating messages on-the-fly. I briefly looked into the MediaWiki ways of doing that. I did not look into possible sources of machine translations, because I think that's out of scope for this task. The MessageCache core service is responsible for getting the message text. It has a function called getMessageFromFallbackChain, which is used to make fallback languages a thing. It should be easy to add a hook into it to allow extensions to define fallback messages on-the-fly, which would make on-the-fly machine translations easy (provided we actually have a translation service that has an API we can use). However, it would probably also require a caching layer over machine translations, which is probably more trouble than this is worth.

Pros:

  • It won't pollute TWN repository
  • It will allow us to disable this feature when a certain user preference/query parameter is present, allowing users to turn off machine translations

Cons:

  • Too engineer-time expensive.
Separate message group

We could add another message group(s) to MessagesDirs, which would not go through TranslateWiki at all, and which will be loaded at the end, after all TWN-populated message groups. That would make them to act as a fallback to all human-populated (TWN-populated) groups.

Pros:

  • It will not pollute TWN translation repository. Messages untranslated by humans will be still marked as untranslated, while providing a machine translation to real users as a benefit.
  • It would probably allow us to remove this group when a certain user preference/query parameter is present, allowing us to disable machine translations for newcomers who speak English.

Cons:

  • Not sure?
Machine translations imported into TranslateWiki.net

The TranslateWiki.net interface recognizes several categories of messages:

  • Untranslated – for messages that are in TWN, but not yet translated by a translator
  • Fuzzy/Outdated – for messages that were translated by a translator, but their English translation was changed after the translation was made
  • Translated – for messages that were translated by a translator and are up to date
  • Verified – for messages that were translated by a translator, and verified by another translator

In addition to that, messages can be flagged as optional, but that is out of scope for this task.

Translations can be added into TWN by three ways:

  • Online translation – regular translation process made in the interface by translators
  • Offline translation – translators specifically flagged by TWN staff as offline translators may use https://translatewiki.net/wiki/Special:ImportTranslations to import translations in gettext/po format (docs; @Urbanecm_WMF has this flag on his TWN account)
  • Imported from external source – while it is not recommended, translations can be imported by directly modifying the JSON file in our codebase (ie. directly modifying cs.json, for instance).

Either using the offline translation feature or directly modifing JSON files in our codebase sounds like a good way to add translations into TWN.

There is no way how to mark a translation as "machine-translation", or even manually flag a message as untranslated. Language-Team might consider adding such a feature for us. Considering it is possible to do on-the-fly translations or introduce an extra message group, I (@Urbanecm_WMF) would personally vote for pusuing that instead.

Pros:

  • Easy to do without changing any code

Cons:

  • No way to automatically detect which messages are translated by humans and which are translated by a machine
  • No way to disable autotranslation
  • Pollutes TWN repository with machine translation
Open questions
  • Should we add translations to TWN, or translate them on the fly?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Urbanecm_WMF is going to begin a research spike on this task, to think about possible ways to pursue this, open questions, and possible blockers.

I'm a bit skeptical about this. Translatewiki already incorporates machine translation (the translator has to accept it, which is not a trivial amount of time given how many messages we have, but compared to importing the translations and then having a community member review them, it wouldn't really be different). Most communities are pretty strongly opposed to machine translation without human oversight, so not sure why this would fare better than other attempts to utilize it.

import then to translatewiki.net, with a tag "to be checked" (if exists)

There is a review workflow (although not really used, I think): once a translation has been created, it can be marked as reviewed. So "to be checked" is the default state and this requires no extra effort.

(The other option would be to mark these automated translations as "fuzzy", which means they do not get exported to MediaWiki until someone manually accepts them. But that wouldn't be really different from relying on Translatewiki's own machine translation suggestions.)

OTOH we'd maybe have to figure out a way of letting the software know these translations are "suspect", to avoid poisoning translation memory. Or maybe that can be done by importing the right way, I'm not really familiar with how translation memory works.

FYI here is an interface example of machine translation suggestion:

translatewiki machine suggestion.png (339×1 px, 56 KB)

(The Apertium one; the other is translation memory.) The translator can click on it to copy it to the edit field.

I'm a bit skeptical about this. Translatewiki already incorporates machine translation (the translator has to accept it, which is not a trivial amount of time given how many messages we have, but compared to importing the translations and then having a community member review them, it wouldn't really be different). Most communities are pretty strongly opposed to machine translation without human oversight, so not sure why this would fare better than other attempts to utilize it.

That was exactly my position (albeit with different arguments) when I talked about this with Marshall in a meeting. We concluded that it doesn't hurt to try this, and we might abandon this if it shows like a bad idea. Making GE to auto-translate on-the-fly when there is no translation rather than fallbacking to English would save some time for community members, through.

I've added my own notes to the task description about possible ways to implement this, if we want to ultimately go forward here.

Message cache is file built on every appserver, so on-the-fly translation would mean calling the API once for every server, right? And every time the CDB files are rebuilt (ie. every train deploy or scap)? Or coming up with some new custom cache layer. Sounds like way more trouble than it is worth.

A quick fact check:

or even manually flag a message as untranslated.

Assuming you mean translatewiki.net. Outdated aka fuzzy messages are treated as untranslated in the translator interface. It's a limitation of banana i18n format that those cannot be indicated. I believe (not verified) that adding !!FUZZY!! to the messages in the files would mark them outdated in translatewiki.net. But then !!FUZZY!! would show to the users too, so it is not an option unless you remove it somehow before displaying. Adding !!FUZZY!! in translatewiki.net does not have this problem, as our code strips it on export.

The other option would be to mark these automated translations as "fuzzy", which means they do not get exported to MediaWiki until someone manually accepts them.

They do get exported, they just don't count as translated when calculating statistics.


I have another suggestion for you. Step one: do offline MT translation and commit those into the repo as separate json files. Step two: load those messages in a way that human made translations take precedence.

I see a few options how to do this:

  • Use wgMessagesDirs global and be careful about the loading order (simplest and most performant)
  • Use LocalisationCacheRecache/LocalisationCacheRecacheFallback to do the same. See the LocalisationUpdate extension for an example use of this hook.
  • Use MessageCache::get to do this logic runtime. See the WikimediaMessages extensions for an example use of this hook.

No need to involve translatewiki.net and deal with all the various problems (some listed in above comments) that it would cause
You have full control
No message tracking. You would have to take care of yourself to update the machine translations if you add/change messages
You need to do some initial exploration to get the proposed logic working

In general, I do not recommend using unedited machine translations. If you do that, I think those should be clearly marked as such that the user knows they are machine translations.

Message cache is file built on every appserver, so on-the-fly translation would mean calling the API once for every server, right? And every time the CDB files are rebuilt (ie. every train deploy or scap)? Or coming up with some new custom cache layer. Sounds like way more trouble than it is worth.

It's probably rebuilt on every appserver when doing a full scap sync. Full scap is being avoided unless really necessary, as even rebuilding all translations from the json files themselves takes a lot of time (90 % time of full scap is localisation cache rebuild). I don't think few messages from us would change that negatively, but obviously that's just a guess.

@Nikerabbit's suggestions about committing separate json files with machine translations, and letting human translations take precedence. That way, translatewiki.net is unaffected (and users can still use machine translation aide, as it's displayed there, as @Tgr noted), and our interface looks translated.

I also think we need to somehow detect them when rendering, and add an icon to them or something that will indicate it's machine-translation. No idea right now how to do it through, would need to research on that.

Terminology note: MessageCache is cache for messages customised in the wiki using MediaWiki namespace. LocalisationCache is the cache for all messages (and other things) from file system.

I also think we need to somehow detect them when rendering, and add an icon to them or something that will indicate it's machine-translation. No idea right now how to do it through, would need to research on that.

Maybe translating the interface could be a structured task that you need to finish to unlock the other task types :)

Wrt detection, I don't think we need to tell which individual messages are machine-translated (what would we do with it?). We need a global "translation needs review" flag instead; if that is true, we'd add a link somewhere on the wiki asking people (maybe the new users themselves) to help with translation. That can be handled in some low-tech way, just use a message for that, e.g. ask translators to set it to - and then check if it is disabled. And then we can go with the wgMessagesDirs approach (or !!FUZZY!! on translatewiki, which also seems reasonable to me. Although maybe it would mess up the translation memory?)

If we really want to auto-detect, a horrible hack would be to add some invisible character like ZWJ to the end of the translations and check for the presence of that.

If we really want to auto-detect, a horrible hack would be to add some invisible character like ZWJ to the end of the translations and check for the presence of that.

By the way, this problem is a very similar to the problem to knowing which language the translation is in, if fallbacks are used. That causes issues like T268492: When translation falls-back, does not use fall-back language's plural rules.

Would it be possible to have a summary of ideas and findings for Wednesday? It would help Marshall and I to think about the next steps concerning our translations strategy.

Would it be possible to have a summary of ideas and findings for Wednesday? It would help Marshall and I to think about the next steps concerning our translations strategy.

Will write something!

Anything new on this side? At the moment, the most recent deployments have seen translation needs being covered. But we may need this solution for future deployments.

MMiller_WMF changed the task status from Open to Stalled.Mar 10 2021, 2:03 AM

Thanks for checking on this. I think our interest on this idea has decreased, given the many challenges we identified and, as you say, the fact that our deployment approach right now is resulting in communities translating quickly. I'm making this "Stalled".

Urbanecm_WMF triaged this task as Lowest priority.
Urbanecm_WMF subscribed.

Marking it as lowest, unassigning from me and removing from the sprint board, as I'm not actively working on this anymore.

Couldn't machine translated message (using another translation engine) be used to "pre-feed" a separate translation memory ("proposed") appearing as a separate source on the right of the translation panel UI?

Then this source would be purged item by item the proposal and our existing translation memory have both been evaluated at the same time (the translator should not be able to paste directly to the message input box, before voting which of the input source best qualifies. After submitting this vote, the voted option becomes selectable to click and paste it in the input box (where it must still be reviews can still be corrected).

The results of the vote when comparing sources should also be logged to be analyzed : we could find this way some typical constructs that cause any engine to give the wrong hints, and may be this could also be used with some machine learning process to precompute a relevancy score, that would sort the proposed sources in the list of proposals to vote. And if the proposal with the highest score has a low pre-rating score (with high probability of error, it should not be possible to submit these proposals in the input box by a simple click until the scores get updated by some other votes.

Ratings could be based on the detection of typical tvars or notable markup present in the source (whose modification should be reduced to the minimum (lettercase may be different for some proposals or with the lettercase of the original, but it would not bring a negative score if these markups are know as being case-insensitive; some pattern matching with regexps could be used to detect markup tags that must match together when they already match in the source and should match as well in the proposal; some terms may be part of a common terminology used in other messages of the same message group, or parent project, with some knows variants such as plural marks, gender marks, or capitalisation of initials; some small negative scores should be given when there are other recommanded typographies, e.g. for dashes/hyphens with or without spacing, more negative scores if there are difference of diacritics with the projet's terminology, but less if this is for the global cross-project terminology. Some negative scores as well if there are large differences in the number of words, or a large difference between ratios of words/numbers vs. punctuations. Various metrics like this can be computed to evaluate the proposals and find what is the best proposal. Finally the proposals that best match our common translation memory trained by humans would get higher scores.

If the proposals do not have a significantly enough difference of ratings no one would be clickable directly: translators would have to only use them visually to type the text in the input box (even copy-pasting from the displayed proposals should be blocked (at least with a sufficent delay, forcing translators to take their time to read the proposals and the source message and decide fairly with less errors what to do, and the event that occured in last July where a newly subscribed Spanish translator started to click frenetically on the first proposals coming from the internal machine translation engine: these users will be slowed down and won't use TWN as a game, in a competition to get highest translator score). May be humans will enter some minor typos if they are forced to type insted of clicking directly on proposals (e.g. missing letters or accents) but at least their input would make better sense and the slower input rate would force them to perform a real review. As well, if we click on one proposal or click on the button to copy the English source, the "submit" button on the input form should be dabled for at least 5 seconds.

But for now I consider that all machine translation generate too many errors. It's still interesting to kwow their suggestions because they provide hints and they protect humans from misreading of the source message to be translated and then translators to submit maybe good translation meaning something that is unrelated (for example reading "rain" in the source as if it was "train", then assume this should translate a guided vehicule on railway while also wondering if this is not simply meaning "exercize"