Page MenuHomePhabricator

Softcatalà translator - requested for integration as an MT service for CX
Closed, ResolvedPublic

Description

Details of this system as follows:

Name of the translator "Softcatalà translator"

API Calls:
https://api.softcatala.org/v2/nmt/listPairs

Project: https://github.com/Softcatala/nmt-softcatala

Translation service: https://www.softcatala.org/traductor/

The translation models support the following language pairs:

  1. German (de) ↔ Catalan (ca)
  2. English (en) ↔ Catalan (ca)
  3. French (fr) ↔ Catalan (ca)
  4. Galician (gl) ↔ Catalan (ca)
  5. Italian (it) ↔ Catalan (ca)
  6. Japanese (ja) ↔ Catalan (ca)
  7. Dutch (nl) ↔ Catalan (ca)
  8. Occitan (oc) ↔ Catalan (ca)
  9. Portuguese (pt) ↔ Catalan (ca)
  10. Spanish (es) ↔ Catalan (ca)

Event Timeline

Pginer-WMF subscribed.

Editors from Catalan Wikipedia have created a user script to access this service form Content Translation. This signals the interest in using it, and also results in a sub-optimal integration since the readaptation of styles and other rich content is not applied.

The translation tool from Softcatalà is not only the best for Catalan, but currently the most updated and revised for the pairs with Aragonese and Occitan too. Considering that they are neighboring languages, its final deployment would expand the benefits up to 3 minorized languages (1% of the total existing languages on Wikipedia).

Just to make sure everyone is on the same page: Aragonese and Occitan translators at Softcatalà are Apertium under the hood, and should be live with the existing apertium integration

The integration of Softcatalà on the list would be a great improvement for the Catalan Wikipedia. Other translate tools often have trouble with our language, but Softcatalà uses models specifically made to improve on those cases.

It should be easy to integrate as well. It's based on Apertium, so it uses the same language tags (languagenames.json), and probably a very similar translateText function. The user script we made could also be used as a basis for the call prompt.

This project provides CTranslate2 optimized models. That is a good news because WMF's self hosted neural machine translation service MinT can include such models.

Change 927610 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Add SoftCatala NMT model for en->ca

https://gerrit.wikimedia.org/r/927610

Change 927610 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Add SoftCatala NMT model for en->ca

https://gerrit.wikimedia.org/r/927610

Change 929038 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-06-10-124931-production

https://gerrit.wikimedia.org/r/929038

Change 929038 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-06-10-124931-production

https://gerrit.wikimedia.org/r/929038

Mentioned in SAL (#wikimedia-operations) [2023-06-12T06:54:26Z] <kart_> Updated MinT to 2023-06-10-124931-production (T284905)

This is deployed in production. While using CX(or SX), for en->ca, MinT will be listed as additional MT provider. You may set it as default too using the button next to the selector. The English->Catalan Machine translation is provided by SoftCatala model. Any feedback on the performance, translation quality is welcome.

Hi. What is the state of deployment for the other language pairs like Dutch to Catalan? I just tried to use it via the Content Translation tool and it’s not available yet at all.

@Pginer-WMF SoftCatala is now the default model for en-ca translations but not supported for other language pairs. I'm not sure there is an immediate plan for support other language pairs, too. Should this task be moved to the backlog?

We plan to re-run the report on machine translation service usage (T338606) by the end of September, and propose a series of changes about default services and default models based on the data. As part of those we can include the use of Softcatalà for additional pairs. Running the report again a coumple months after the changes would allow to have more signals about the effects of the change.
Meanwhile I also plan to check multiple samples with both NLLB-200 and Softcatalà models.

So I guess we can move this task to the backlog for now, and create a sub-task (T345371) to capture the comparison based on samples as a separate one.

Pginer-WMF claimed this task.

Hi. What is the state of deployment for the other language pairs like Dutch to Catalan? I just tried to use it via the Content Translation tool and it’s not available yet at all.

In the last update for MinT we updated the models and now Softcatalà models are used as default for the languages they support in MinT. You can try how this works in Content Translation and in the MinT test instance. Feel free to share any feedback here or in the MinT talk page. Thanks!

Why is it that still the Content Translate is using the model from Google Translate as the default one, and not the MinT with the Softcatalà improvements?

It should not be that way, and it is good to remind that most of the editors may not check which assistant they have activated. Plus the fact that the most of the Catalan-language Wikipedians are not familiar with the "MinT" concept (nor Elia, nor Yandex..). However, they are with the name "Softcatalà", as it is a well-known organization with a high and widespread reputation. Thus it is virtually impossible that they know that the best model available for and to Catalan, Softcatalà, is embedded in this other name.

I am afraid that the only way to make all this previous effort fully efficient is by naming "MinT" as "MinT - Softcatalà" for the pairs available in Catalan language, and by setting it as the default model. Otherwise, its impact will be close to none. In the meantime, people will not select the foldable panel or, if they do, they will keep Google. Which is against our values and organizational roots. Is it possible to implement such a path? Thanks!

Why is it that still the Content Translate is using the model from Google Translate as the default one, and not the MinT with the Softcatalà improvements?

Hi @xavidegr. We are a bit cautious when changing the MT defaults, since that can affect the experience of translators. Most users will rely on the default provided, and we run periodic analysis of the MT translation usage. From the most recent MT usage report we'll be looking for signs of non-default MT options to show significant use and good results we'll propose to adjust the defaults. In addition, if communities find any option promising, we can enable it for a given period of time.

One key aspect to consider for this case is that while Google Translate natively supports HTML, the underlying models used by MinT including Softcatalà are based on plain-text. For the translation of Wikipedia articles, MinT has to re-apply the formatting to the translated text, and that is a process that is not perfect. So from the final user perspective, there may be a percentage of additional formatting mistakes coming from MinT (which we are working to reduce) that may affect their preference at comparable translation qualities.