Page MenuHomePhabricator

The NLLB-200 MT engine in MinT returns standard Arabic translation instead of Moroccan Darija in Moroccan Arabic Wikipedia
Open, MediumPublic

Description

A Moroccan community member reported that the initial translation in the MinT translation service deployed to Moroccan Wikipedia is not Moroccan Darija (ary); it is standard Arabic. This ticket will help us collect feedback from the Moroccan Community on what to do, we have proposed the following options:

Option A
Check if having the current MT is better than no MT as we had seen other communities like Egyptian Arabic opting for standard Arabic Machine Translation instead of none.

Option B
Explore if there are open-source Machine Translation engines that work for Moroccan Arabic that we can integrate to our self hosted machine translation service(known as MinT).

Option C
Disable translation to Moroccan Arabic

Event Timeline

UOzurumba renamed this task from The NLLB-200 MT engine returns standard Arabic translation instead of Moroccan Darija in Moroccan Wikipedia to The NLLB-200 MT engine in MinT returns standard Arabic translation instead of Moroccan Darija in Moroccan Arabic Wikipedia.Jun 20 2023, 2:11 PM
UOzurumba updated the task description. (Show Details)
Pginer-WMF triaged this task as Medium priority.Jun 20 2023, 2:23 PM
Pginer-WMF updated the task description. (Show Details)
Pginer-WMF moved this task from Backlog to Adding languages on the MinT board.

I consulted with some colleague who used the translation tool, and their experience seems equally negative. I hope the tool will be trained on texts from arywiki in the future so it can produce better results.

For clarification: the translation spewed by MinT is mostly MSA, with some Egyptian Arabic words or expressions thrown in sometimes. Some words come out correct, i.e. they're plausible Moroccan Darija words, but they're always words that exist in MSA or Egyptian too. My guess is either the dataset the MinT model for Darija is based on is mislabelled, or it's heavily contaminated by MSA and Egyptian content. It also doesn't use the full character set we're using on Wikipedia, e.g. for letters that don't exist in Standard Arabic (hard g, v, and p). Either way, the translation is mostly useless, and for the majority of the time I have to deactivate it and start with a blank slate.

This is not a case of using a conservative "etymological" spelling for words, i.e. closer to Standard Arabic. It's still an open debate in the Moroccan Darija community, including on Wikipedia, but even a correct translation with very conservative spelling would still be extremely useful, because practically we almost only use phonetic spelling for words that could get confused, or based on the most common spelling among native speakers, if it differs from the MSA spelling. Since Moroccan Darija also differs quite often in grammar and sentence structure from MSA, this is a case of word order, conjugation of verbs, choice of words and expressions, etc, being partly or completely off the mark.

Some concrete examples:

  • Several times MinT translated because with عشان. This is a pure Egyptian Arabic word, and its probability of occurring in a Moroccan Darija conversation or text is exactly 0%. The correct word to use would be حيت (etymological spelling حيث), or محيت or على قبل
  • Month names in Morocco are often different from those used in the Middle East. For example August should be translated as غشت not أغسطس and December as دجنبر not ديسمبر, etc
  • A large percentage of Darija vocabulary is Latin (mainly French or Spanish) or Amazigh (aka Berber) based, though the majority (probably more than 70%) is Arabic based. I don't think I've seen a single case where MinT tried to produce a word that is Latin or Amazigh based, even if it was used in the wrong context or not a Darija word to begin with. Honestly, I highly doubt there's any Moroccan Darija in the MinT dataset.

Here's a short paragraph from my attempt to translate the English article for "Marc Andreesen":

During his career, Andreesen has worked at Netscape, Opsware, founded Andreessen Horowitz and invested in many successful companies including, Facebook, Foursquare, GitHub, Pinterest, LinkedIn and Twitter.

MinT translation:

خلال مهنته، عمل اندرييسن في نيتسكايب، أوبسواير، أسس اندرييسين هوروفيتز واستثمر في العديد من الشركات الناجحة بما في ذلك، فيسبوك، فورسكوير، جيت هوب، بينتيرست، لينكدين وتويتر.

My translation:

ف لكاريير ديالو، أندرييسن خدم ف نيتسكييپ، ؤپسوير، أسّس أندرييسن هوروڤيتز، و ستتمر ف بزاف د الشركات الناجحة بحال فيسبوك، فورسكوير، ڭيتهاب، پينتيريست، لينكدين، و تويتر.

Thanks for the very useful and detailed context, @Maurusian.
Based on that, I can think of a couple directions to explore: (a) inspect NLLB-200 to confirm the issues with Moroccan Arabic (ary) come from their training data, and (b) consider alternatives where an appropriate variant is provided instead.

Inspect NLLB-200

Thee dataset used for training NLLB-200 is available here. However, I've not found the data for Moroccan Arabic (ary). @santhosh may have a better idea how to check for it.

Our test instance for MinT allows to try MinT translations more directly. You can check other Arabic-related languages from those supported by NLLB-200 to check if they provide a translation that is more useful for Moroccan Arabic Wikipedia. Note that North Levantine Arabic (apc), and Najdi Arabic (ars) are not visible in the selector yet until T336683 is finally resolved.

Consider alternatives

If Moroccan Arabic support cannot be provided using NLLB-200, we can consider the Opus project models instead. Based on their website, the data comes mainly from Wikipedia translations (sample), and Tatoeba sentences (Sample).

If the data used in Opus is using the proper variant for Moroccan arabic, the translation quality will depend on the volume of training data available. Fortunately, both Wikipedia translations and Tatoeba sentences) are open for contributions and can grow over time with the help of the community.

I added Moroccan Arabic to the list of languages to request pre-trained models to the Opus project (T343340)

The feedback we got from Moroccan Arabic speakers is that the Opus data samples from Wikipedia translations aligns very well with their language while the samples from Tatoeba have more of mixed results. Based on that, we consider to include Moroccan Arabic in the list of languages to request OpusMT translation models (T343340), and we also encourage the community to get involved in Tatoeba to review existing translations to Moroccan Arabic and contribute new ones in order to help improve the translation quality for the future models.

We think there is potential to provide OpusMT models for Moroccan Arabic and for those to improve with translations and corrections from the community. Having said that, the ultimate assessment on the quality will be once the models are available to try by the community.

Thanks for all your help in this process!

Hello @Pginer-WMF ! Thanks for your comment! I will include the summary I sent in the email here as well for reference:

The Translation Tool for ary will use MinT with the OpusMT models to generate automated translations (which we have the option to edit or deactivate and start from scratch)
The model is trained on two data sources: Tatoeba, and the output of our own translations using the Translation Tool. Therefore:

  • The more we use the Translation Tool, the better the generated results will become.
  • We should try to improve and fix translations on Tatoeba separately.

We've also dicussed this development afterwards, and we agreed on the above two points, and will also encourage other contributors, within the Wikimedia movement and outsiders as well to contribute to Tatoeba. A first step would be to summarize and translate the key points of Tatoeba documentation from its wiki, so we can get started as soon as possible.
Additional point: there's a Moroccan developer who's working on a Darija version of ChatGPT (ChatGPT itself only produces ary text under duress, and quite imperfectly). I've tested the chatbot, and it produces pretty good results, though it has limited information and terms to use, and its answers are usually very brief, due to the small dataset. I don't know if there could be venues for mutual support in this regard.

Hello @Pginer-WMF
I have a question concerning automatically generated titles in the translation tool. Sometimes when I try to translate to ary, a title is filled in the top box, which seems to be AI generated. Which dataset and/or engine is being used for this feature? Because, while the title is always wrong, it is usually wrong in interesting ways. For example, just now I started translating "List of countries' copyright lengths", and it generated the title قَائِمَة دْ الْمُدّة دْيَالْ الْحَقْقْ الْكْتَابْ. I notice here a few things:

  • it used the particles دْ and دْيَالْ (both meaning "of") correctly, and these are very endemic to Moroccan Darija (ary)
  • it translated "copyright" as الْحَقْقْ الْكْتَابْ, which seems to be from Arabic حقوق الكتاب (literally "book rights") but with shorter vowels. Somehow it learned that many Moroccan Darija words are similar to their Arabic equivalents, but with shorter or silent vowels (الْحَقْقْ is a word that doesn't exist in either Arabic or Darija by the way). The whole expression is also kinda strange, since Arabic uses حقوق النشر (literally "publication right"), so I'm not sure where it got that from.
  • it's using diactretics for all words, which is a practice we discourage in arywiki, and I think the same rule is applied in arwiki. The generated title normally should be: قائمة د المدة ديال الحقق الكتاب
  • the whole expression, though not grammatically wrong (except الحقق which should be حقوق), is not optimal or idiomatic, probably the equivalent of "list of countries by copyright length" would be better

Hello @Pginer-WMF
I have a question concerning automatically generated titles in the translation tool. Sometimes when I try to translate to ary, a title is filled in the top box, which seems to be AI generated. Which dataset and/or engine is being used for this feature? Because, while the title is always wrong, it is usually wrong in interesting ways. For example, just now I started translating "List of countries' copyright lengths", and it generated the title قَائِمَة دْ الْمُدّة دْيَالْ الْحَقْقْ الْكْتَابْ. I notice here a few things:

Thanks for your interest and curiosity on the project, @Maurusian.

Support for translating titles was added in T225494, from the task details:

This ticket proposes to adjust the translation title using the following methods when available (in order of priority):

  • Wikidata label in the target language. For example, the Wikidata item for Lemon has a label for the Ewe (ee) language despite the article not existing in such language. Thus, a translation from English to Ewe should have "mumue" as a title instead of "Lemon".
  • Machine translated version of the title. Machine translation can help to have an initial version of the title in the local language. Even if the title is not always correct, the translation errors can drive the user attention to it and help to encourage editing the title.

So the title translations could come from info on Wikidata (if available) or the default machine translation service (if the Wikidata label is missing). There is a pending task (T324452) to show a card similar to the one shown for paragraphs where the current translation service is shown with options to change it, which may add clarity about where the translation were coming form in this context.

Apart from exploring the OpusMT support for Moroccan Darija (T343340), we have identified another existing model that may support the language: MADLAD-400 open source translation model.

I compiled some example translations below:

English (en)Moroccan Darija(ary)
Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge.تخيل عالماً يتاح فيه لكل شخص على هذا الكوكب الوصول المجاني إلى مجموع كل المعارف البشرية.
During his career, Andreesen has worked at Netscape, Opsware, founded Andreessen Horowitz and invested in many successful companies including, Facebook, Foursquare, GitHub, Pinterest, LinkedIn and Twitter.عمل أندريسن خلال حياته المهنية في نيتسكيب وأوبسوير، وأنشأ شركة اندريسون هورويتز واستثمر أمواله بنجاح كبير من بينها الفيسبوك وفورسكوير وجيت هـاب وبرنستلين ولنكدين وتويتر. وقد تمكنت الشركة التي أسسه منذ عامين فقط على الأقل الآن (2013) إلى حد ما مما يجعل هذه الشركات تشكل مصدر إلهام لهذا العام أيضاً؛ فهي تقدم خدمات متقدمة لشركات الأعمال الصغيرة والمتوسطة والأفراد الذين يتطلعون للاستثمار بشكل أكبر أو أقل بالتكنولوجيا الحديثة مثل الإنترنت والتسويق عبر وسائل الإعلام الاجتماعية ومواقع الويب الخاصة بالمشاريع التجارية الإلكترونيّة والإنترنت للأعمال والخدمات المالية الرقميَّة الأخرى المرتبطَــى بها كأدوات للتجارِـــا الاجتماعيَّــــن والأنشطَّاء التجاريين المتنقليْنَ وغيرهم الكثيرون غيرهم...
The Moon is Earth's only natural satellite. It orbits at an average distance of 384,400 km (238,900 mi), about 30 times Earth's diameter. The Moon always presents the same side to Earth, because gravitational pull has locked its rotation to the planet.القمر هو الأقمار الصناعية الطبيعية الوحيدة للأرض. وهو يدور على مسافة متوسطها 384,095 كم (217،6% من قطر الأرض)، أي حوالي ثلاثين مرة أقطار الكرة الأرضيّة تقريباً ويكون دائما بنفس الجانب المتجه نحو كوكبنا لأن الجاذبيات تجعله مقيدًا في دورانٍ حول الكوكب نفسه ويظهر هذا الأمر بشكل واضح عندمـّا يكون قمرة القرص الشمسيّ بالقرب منه أو عند انحراف سطح قمره عن مدار الاتّجاهات الأخرى لكوكبه الذي يحيط به والقمريّ بالمدار المحاذِي له

Once T354666 is completed, Moroccan Darija will be available in the MinT test instance for the community to try more translations and evaluate quality.

Hello @Pginer-WMF
Unfortunately, all three examples in the table above are in Standard Arabic with some mistakes.

The quote by Jimmy Wales is actually translated on the "About" page on arywiki.

Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge.

"تخيّل واحد لعالم فين كلا بنادم كيعيش فوق لأرض عندو أكصي ل مجموع لمعرفة لبشرية."

The second example is from the article about Marc Andreeson which is also translated on arywiki, and I have provided a translation in ary above.

The dataset seems either mislabeled or was created with an engine that claims to generate Moroccan Darija, but actually generates Standard Arabic. In my experiments with ChatGPT, I've had this issue often, and I had to guide it very closely so that it generates ary text that is acceptable.

Hello @Pginer-WMF
Unfortunately, all three examples in the table above are in Standard Arabic with some mistakes.

Thanks for the input @Maurusian. Based on the information provided, MADLAD-400 does no seem promising as viable option to support Moroccan Darija.

We'll have to wait for the results of our collaboration with the OpusMT team (T343340) to get an OpusMT model for the language. Meanwhile, if you know of any good and feely licensed multilingual resources that can be used for training the models (beyond Wikipedia translations and Tatoeba), feel free to share them since they can help improve the quality of future OpusMt models.

Thanks!