Page MenuHomePhabricator

MinT invents phrases in translation from Catalan to Asturian
Open, Needs TriagePublicBUG REPORT

Assigned To
None
Authored By
Zapipedia
Sep 27 2024, 10:22 PM
Referenced Files
F57623830: translate.wmcloud.org_(Wiki Tablet) (3).png
Oct 18 2024, 12:10 PM
F57623824: translate.wmcloud.org_(Wiki Tablet).png
Oct 18 2024, 12:10 PM
F57623828: translate.wmcloud.org_(Wiki Tablet) (2).png
Oct 18 2024, 12:10 PM
F57623832: translate.wmcloud.org_(Wiki Tablet) (4).png
Oct 18 2024, 12:10 PM
F57623826: translate.wmcloud.org_(Wiki Tablet) (1).png
Oct 18 2024, 12:10 PM
F57563497: image.png
Sep 27 2024, 10:33 PM
F57563472: image.png
Sep 27 2024, 10:22 PM

Description

Steps to replicate the issue (include links if applicable):

  • Start a translation from Catalan to Asturian. For example, articles Halima Sadaf Karimi or Hosai Ahmadzai.
  • Use MinT for translation
  • Check if the translation corresponds with the original text

What happens?:
The tool invented the translation.

  • In Halima Sadaf Karimi translation:
    • In the article in Catalan, it says "A women's rights activist, she was persecuted by the Taliban when they returned to power in 2021. She criticized the passivity of the Afghan government and the international community in the face of Taliban attacks in August 2021" (Activista pels drets de les dones, va ser perseguida pels talibans quan van tornar al poder el 2021. Va criticar la passivitat del govern afganès i de la comunitat internacional davant els atacs talibans l'agost del 2021).
    • The translation that the tool offers says "Her career as an actress was the most important in the history of television. Her first job was in 1990 at The Guardian, where he published a series of books about the Taliban war." ("La so carrera como actriz foi la más importante de la hestoria de la televisión. El so primer trabayu foi en 1990 en The Guardian, onde publicó una serie de llibros sobre la guerra de los talibanes").
  • In Hosai Ahmadzai translation:
    • Offers a nonsense translated title "El so padre, el so padre, yera un fíu de José" which means something like "His father, his father, was a José son".
    • The introduction also has an invented phrase.

It seems like the tool is taking text from a different article or is directly hallucinating.

What should have happened instead?:

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

image.png (584×1 px, 235 KB)
image.png (1×1 px, 283 KB)

Event Timeline

Thanks for reporting with specific examples @Zapipedia.

MinT is based on several machine learning models, the quality of these models depends on factors such as the amount and quality of the data used to train the models. Unlike other translation approaches, when the machine learning models fail, they may be more prone to invent contents. We have encountered some of these issues, especially with short sentences, where the model tries to be "creative" and fill the lack of context. For those cases, we plan to support community-verified translations (T351748).

However, this case seems more related to lack of data for Asturian compared to other languages (see table below). I checked translations of the same text for English and Spanish, using both Softcatalà (the default for the language pair) and with NLLB-200 (the model supporting Catalan-to-Asturian), and the translations don't present similar issues when translating to English and Spanish.

In order to improve the situation, the Asturian community could:

  • Share open-licensed multilingual data that could be helpful for training better models. The data can be integrated into the Opus project to improve the quality of their translation models.
  • Participate in projects that generate such multilingual data to improve the models. These could be translating Wikipedia articles with Content Translaton (correction are automatically integrated into Opus) or providing translations in Tatoeba (also integrated with Opus)
  • Request for machine translation to be adjusted in Content translation: increase limits to enforce a greater degree of edits on top of the initial machine translation, or ask for MinT to be disabled (although that also breaks the cycle of correcting the translations helping to improve the future models).
Model usedCatalan → EnglishCatalan → SpanishCatalan → Asturian
NLLB-200
translate.wmcloud.org_(Wiki Tablet) (1).png (768×1 px, 77 KB)
translate.wmcloud.org_(Wiki Tablet) (4).png (768×1 px, 76 KB)
translate.wmcloud.org_(Wiki Tablet) (2).png (768×1 px, 81 KB)
Softcatalà
translate.wmcloud.org_(Wiki Tablet).png (768×1 px, 78 KB)
translate.wmcloud.org_(Wiki Tablet) (3).png (768×1 px, 77 KB)