Page MenuHomePhabricator

Enable Content and Section Translation for Cantonese Wikipedia
Open, Stalled, MediumPublic

Description

Content Translation has been used to translate more than a million articles across all languages. Content Translation is provided by default in 92 language Wikipedias, and Section Translation is available in 18 wikis.

Given the editor activity on Cantonese Wikipedia and the mobile usage we consider that they could benefit from having both Content and Section Translation enabled by default. As part of this process, we want to learn from the Cantonese editing community how well those tools suit their needs and identify potential improvements. We'll communicate with the editing community and only proceed with the enablement if there is no major concern.

Steps:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
KartikMistry subscribed.

@Pginer-WMF It seems the Cantonese vector model isn't available, so we can not generate template parameter alignment for it.

Hello @KartikMistry, I can remember announcing our intentions to enable the CX and SX in their Wikipedia in this ticket and there was no objection. You can enable the tool in their Wiki if there are no blockers now. Thank you!

Change 889656 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/mediawiki-config@master] Enable Section Translation in 9 Wikipedias

https://gerrit.wikimedia.org/r/889656

Change 889656 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable Section Translation in 9 Wikipedias

https://gerrit.wikimedia.org/r/889656

Mentioned in SAL (#wikimedia-operations) [2023-02-16T08:03:04Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]]

Mentioned in SAL (#wikimedia-operations) [2023-02-16T08:05:02Z] <kartik@deploy1002> kartik: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-16T08:15:43Z] <kartik@deploy1002> Finished scap: Backport for [[gerrit:889656|Enable Section Translation in 9 Wikipedias (T323825 T304865)]] (duration: 12m 38s)

The tools seem enabled on desktop and mobile. However, on mobile there seems to be a redirection issue where accessing Cantonese gets users redirected to Ahmaric: https://zh-yue.m.wikipedia.org/wiki/Special:ContentTranslation

Change 890482 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/mediawiki-config@master] Section Translation: Fix language code for Cantonese Wikipedia

https://gerrit.wikimedia.org/r/890482

None of the concerns raised in the original thread has been addressed. If translation to English is needed, should it go here, or should it go to the new subthread?

For me, the most concerning thing here is that the change was communicated to us in a way that was hard to understand (perhaps the language was too “technical”? But I worked in tech and even I did not understand this was what the change was about). No one grasped the impact of the change, not even our admins.

I hope this change isn’t really going to be “permanent”; it has been variously described as “confusing” and “threatens small-language communities”, among other things.

I am aware of this notice. However, it failed to mention that it will be visible to all registered users by default. Content translation has long been available (enabled?) as a beta feature. I would have expressed my concerns had I known this was going to bring so many visible changes that were omitted entirely in the notice. And while I understand that there might be blockers beyond control, the 8 month delay between the ambiguous notice and the eventual deployment last week didn't help.

The main issue raised in the thread I linked above, and that many people find problematic, is the gray links entry point to CX, on the left sidebar in the language list, causing confused users to follow a link to a special page encouraging them to translate the page to a different language.

For some background, editors from the Chinese Wikipedia have translated countless pages previously exclusive to the Cantonese Wikipedia, which were crucial in driving up the disproportionately low traffic received by yuewiki due to search engines consistently favouring zhwiki content and outright stopping indexing yuewiki pages once content gets translated to zhwiki.

Thus, many editors find the recent change deeply concerning, given how this basically makes the previously dreaded operation much more accessible and visible.

On top of that, personally I don't think it is a good idea to enable it by default for all registered users. CX has its merits, and some editors have been using it to create numerous high quality articles. But with CX's almost intruding ubiquity, this change makes it too accessible, even to those who are unfamiliar with the local policies and style guides, putting unnecessary burden on the already struggling editor community.

Change 890482 merged by jenkins-bot:

[operations/mediawiki-config@master] Section Translation: Fix language code for Cantonese Wikipedia

https://gerrit.wikimedia.org/r/890482

Mentioned in SAL (#wikimedia-operations) [2023-02-21T08:04:48Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]]

Mentioned in SAL (#wikimedia-operations) [2023-02-21T08:09:33Z] <kartik@deploy1002> kartik: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-02-21T08:21:24Z] <kartik@deploy1002> Finished scap: Backport for [[gerrit:890482|Section Translation: Fix language code for Cantonese Wikipedia (T304865)]] (duration: 16m 36s)

Thanks for sharing your perspectives, @Al12si and @H78c67c.
We want to provide the tools that help the communities the most and improve them based on the community input. Your comments are very valuable and I think it make sense to keep an eye on the translaiton activity with those comments in mind in case adjustments are needed.
If the change becomes problematic, we are totally ok to revert them. However, I think it could be useful to observe the effect of a change for some time before making immediate changes.

For reference, in 2022 there were 129 articles created with Content Translation in Cantonese Wikipedia (this is 0.9% of total article creation in the wiki). From those, 5 were deleted (3.9% of deletions). These numbers suggest that the volume of translations have been quite low (~2.5 translations/week) compared to the general activity of the wiki (~285 articles/week). I think it makes sense to observe how those numbers change in a few weeks to understand if the increase of translations seems sustainable.

Regarding grey links, those are shown to users that frequently access a given language when the content is missing. For speakers of both Chinese and Cantonese, those will invite to create content in both languages. While it may contribute to mae contents available in Chinese, it will also help get more contents into Cantonese. Since the tool was available in 2015, there have been 328 translations from Cantonese to Chinese (~41 translations/year) which represent 1% of translations to Chinese. In the same period, there have been 798 translations from Chinese to Cantonese which represent 60% of all Cantonese translations. It is unfortunate that search engines are hiding Cantonese contents when they are also available in Chinese, but I don't think the answer should be try to prevent knowledge to be available in more languages (in the light of allowing "anyone to access the sum of all human knowledge"). Especially where editors speaking both languages seem to be transferring more contents form Chinese to Cantonese than the other way around.

As mentioned before, we are open to discuss how to best help communities with the translaiton tools. I was trying to share some data that could help put things in perspective. A we get more data on the effects of the change I can share it to identify signs that may suggest any change to be made.

@Pginer-WMF I think something needs to be made clear, and for the record this should ideally go into some kind of WMF policy document because it seems even translation agencies and large unversity presses (I mean like OUP) don’t understand how “Chinese” (the macrolanguage) works.

When it comes to written Chinese, the term “speakers” is misleading and irrelevant, because many of us are native reader-writers, but not native “speakers” in the normal sense, because we are not taught to write in the language we speak.

In the specific example of Cantonese, in my birthplace, we are taught at an early age (around the time we’re also taught English) to essentially write in Mandarin (國語, ˉgwɔkˏjy). It’s a distinct regional variant (zh-HK) that can be quite different even from zh-TW and very different from zh-CN, but it’s never spoken — unless someone is reading from a script, for example. So when we talk about translation, people would ask, are you a native speaker? If we’re honest we’re not native speakers of our own regional variant, because it’s simply not spoken. So non-natives get to translate into our language. How ridiculous is that?

So what happens to our actual spoken language? It’s not written, except in informal contexts like chats and internet forums, or in scripts (for plays, ads, speeches etc.). It’s written and then discarded and dismissed.

Cantonese Wikipedia is one of those very few places where we actually get to write in our actual spoken language, in a form that has the potential to elevate it into something resembling formal recognition. The use of automatic translation, in ANY form, jeopardizes this.

Let me talk about that 0.9%. Automatic translation is basically shunned on yuewiki. Why? I’ve mentioned that Cantonese used to be never written down, if you know how automatic translation works, you’ll understand right away that any translation from or to Cantonese is bound to be of low quality. We have automatically translated articles that are in a semi-translated state, often with terminology (translator jargon for “vocabulary”) that’s either inconsistent with Cantonese conventions, or outright internally inconsistent. It’s a hassle to fix them (or, in translator jargon, to “post-edit” them). I started working on one several months ago; it’s still in a half-fixed, broken state.

As @H78c67c mentioned above (and as I alluded to earlier when I used the word “threatened”), Cantonese Wikipedia is, for some reason, deprioritized by search engines. The problem is so bad that sometimes when I search for something I’d see plagiarized content from content farms in the results, but not the original page from yuewiki. This is why we care so much about preventing zhwiki from copying yuewiki content. If Google prefers even content farms, it prefers zhwiki.

(Of course, if WMF can talk to Google about this problem and make them fix it, maybe we can change our views.)

I have essentially stopped editing zhwiki (as I mentioned, native speakers of any language that falls under the zh umbrella is also a native reader-writer of some variant of zh, even if they’re not native “speakers”). In the case of zhwiki, zh-CN, written in zh-Hans, is prevailing, and I see the voices of zh-TW and zh-HK users suppressed (or oppressed, if that term speaks more to you). yuewiki is an alternate outlet for zh-HK reader-writers, and I can tell you, after about half a year of editing on yuewiki, zh-HK now looks foreign to me, even though it’s one of my own native languages.

Sorry for the rant, but this really need to be gotten across, and ideally WMF would have some sort of formal recognition of the problem.

For anyone interested here are 2 queries to filter the translated articles in 2022:

For the record, I managed to fix a section translation in the other direction (en>yue) today. It was just a 137-word stub and I must have spent like half an hour comparative-editing it, and at the end about 30% turned out to be wrong (and I’m still not sure if I missed anything). And I’m not talking about the kind of mistakes that you can look at it and see right away; I’m talking about the kind of mistakes that you have to read and reread the original multiple times to check if anything has been translated out of context or in the wrong sense.

This is a huge waste of editor time. IMHO section translation should not be enabled by default on *any* wiki. Please turn it off as soon as possible.

PS: “introducing errors into Wikipedia” is actually something mentioned in the complaint thread. I didn’t particularly agree with that view when I saw the comment initially, but after today’s comparative edit I would agree: Enabling section translation does look like an attempt to introduce errors into Wikipedia.

For the record, I managed to fix a section translation in the other direction (en>yue) today...

Thanks for your feedback, @Al12si. I'm sorry to hear about this experience. The tools to translate Wikipedia articles like Content and Section translation are used by many editors across many languages at different points in time (some examples capturing successful experiences from 2015, 2016, 2018, and 2022 ). As with any tool, it can be misused in different ways and we work to improve the tools so that they encourage the creation of good content.

In the particular case of the mobile experience supported by Section Translation, it is in active development and there is room to improve it based on constructive feedback. We are open to hear form the different communities. For example, as a follow-up from a campaign on Bengali Wikipedia we launched a survey to learn more about how the tool was working for both editors using it and reviewers, and we incorporate the input in our plans for improving the tools.

For context, I provided below some of the steps that editors using the tool will go through when translating and how editing the initial automatic translaiton is supported and encouraged:

Tutorial encouraging review of initial translationsSuggested translationEdit viewFeedback about how much the initial translation has been edited
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&from=en&to=yue&page=Helianthus&sx=true(iPhone XR) (3).png (1×828 px, 96 KB)
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&from=en&to=yue&page=Helianthus&sx=true(iPhone XR) (4).png (1×828 px, 226 KB)
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&from=en&to=yue&page=Helianthus&sx=true(iPhone XR) (5).png (1×828 px, 104 KB)
test.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&from=en&to=yue&page=Helianthus&sx=true(iPhone XR) (9).png (1×828 px, 84 KB)

Translation is not a new way to contribute. Our experience suggests that providing an integrated experience leads to better results than letting editors rely on external translation services. When we get both requests to enable and disable a tool it becomes impossible to support both, so we are really interested to hear in more detail about the kind of issues communities face (preferably with specific examples) and the kind of support that reviewers would find beneficial in order to improve the experience for everyone. Thanks!

This is probably out of scope of this particular thread, but seeing the above I've decided to revisit CX, now with machine translation, and all I can say is I'm shocked by how low the MT quality is.
Using NLLB-200:
Original:

The Hastings line is a secondary railway line in Kent and East Sussex, England, linking Hastings with the main town of Tunbridge Wells, and London via Tonbridge and Sevenoaks. Although primarily carrying passengers, the railway also serves a gypsum mine which is a source of freight traffic. SE Trains operates passenger trains on the line, and it is one of their busiest lines.
...
The Hastings line is built over the difficult, forested, and hilly terrain across the High Weald and sandstone Hastings Beds, necessitating the construction of eight tunnels between Tonbridge and the south coast seaside resort of Hastings. The SER was anxious to construct the line as economically as possible, since it was in competition with the LBSC to obtain entry into Hastings and was not in a strong financial position in the mid-1840s.

Translated:

黑斯廷斯線係英格蘭肯特郡同東蘇塞克斯郡嘅一條二級鐵路線,連接黑斯廷ছ同主要城鎮坦布里奇威爾斯,同埋通過東布里奇同塞文奧克斯連接倫敦。 儘管主要運載乘客,鐵路亦都為石膏礦提供服務,石膏礦係貨運交通嘅源頭。 SE 列車喺線上運營客運列車,係佢哋最繁忙嘅線路之一。
...
黑斯廷斯線建在高威爾德和黑斯廷士沙石床上,難以建造,森林和山地地,需要在東布里奇和南海岸海灘度假村黑斯廷茲之間建造八條 tunnels。 SER急於在經濟上盡可能建設這條線路,因為它與LBSC競爭,以獲得黑斯廷斯的入境,並且在1840年代中期並沒有強大的財務地位。

Immediately obvious is the out of place Bengali character randomly appearing, and NLLB-200's appalling inability to even generate the commas correctly. The first paragraph can use some copyediting, but the second paragraph is straight up nowhere close to Cantonese.

I apologise for hijacking this thread, but since we're on the topic of MT in CX and since I'm not sure where's a better venue to report this, I'd like to point out that the FLORES-200 dataset for yue_Hant is, in fact, completely not yue_Hant. I am not entirely sure what NLLB-200 was trained on, but I downloaded the FLORES dataset used for benchmarking it and of all the 2009 lines in yue_Hant.dev and yue_Hant.devtest, not a single line is in yue (0/2009; 0.00%; zero). This explains the extreme low quality output, but not the half-width commas.

I appreciate WMF's intention to make translating more accessible, but defaulting to this poor quality MT service to newcomers is absolutely unacceptable. It's true that virtually any tools can be misused, but we shouldn't be offering this awful experience to new users to begin with. Can we limit this to experienced users, or at the very least make it opt-in instead of opt-out?

FWIW, many translation agencies (I’d say close to if not actually 100% of them) assume Cantonese = zh-HK, and yes, they call zh-HK “Cantonese” (yue). This might explain why a supposedly yue dataset would be 0% Cantonese.

Thanks for the input @H78c67c and @Al12si for the context. Input from native speakers of the languages we support is essential. When integrating external services, we don't have direct control on them. However, there are several things we can do (and community input will be super useful):

  • Share the feedback with the research team from Meta for them to consider how to improve their translation models. Based on early conversations about Yue, my understanding is that there may be two variants of Yue Chinese identified as Yue-Kouyu (used by Bing Translate) and Yue-Shumianyu (used by NLLB-200). Is this distinction relevant and aligned with your knowledge of the languaghe? If that is the case, is Yue-Kouyu the variant used on Yue Wikipedia?
  • Adjust the limit system to ensure the initial machine translation is edited enough to compensate the quality level provided for a specific language (more details about the limit system ). By default translations can be published with up to 95% of unedited machine translation, if you expect for Yue translations the initial translation to be edited at the very least by a given percentage, we can adjust the system to enforce that.
  • Adjust the defaults. In the case of Yue, the available options are: use NLLB-200 or use Chinese translations from Google Translate. We try to avoid the option to start from scratch as the default because even in case of low MT quality having the links added and pointing to the right articles and similar adaptations can be convenient. But we can try setting a different default.
  • Help systems to learn from user corrections. Even when machine translation quality is low, using Content Translation can help to improve it over time. The date from user corrections is publicly available and it has been integrated in the Opus project. Machine translaiton services using that repository will incorporate the improved translations.

We are happy to try and learn which adjustments result in amore helpful experience for the Yue community across their translators, reviewers and readers. Any input on how you think would be better to adjust the experience is very welcome.

  • Yue "Variants": This sounds like the exact same issue @Al12si pointed out above. It sounds like Yue-Kuoyu (國語; Mandarin Chinese word for "Mandarin") and Yue-Shumianyu (書面語; Mandarin Chinese word for "Written Language") are really just Mandarin Chinese (cmn). Editors on yue.wikipedia.org contributing in these are generally invited to contribute to zh.wikipedia.org instead, since that's where all the Mandarin/guoyu/"shumianyu" Wikipedia content are hosted. cmn is mutually unintelligible with yue.
  • Defaults: NLLB-200 isn't even generating commas properly, isn't in the right language and only generates plain text. I honestly see little value in keeping it at all. Instead of dealing with the useless stuff it generates it's probably easier to just start from scratch.

@Pginer-WMF: This shows the problem is even worse than what I expected: Even Microsoft has no clue as to what Cantonese is. @H78c67c is absolutely correct: Neither yue-guoyu nor yue-shumianyu is Cantonese; they are not even variants.

FYI: “yue-guoyu” (what I referred to as ˉgwɔkˏjy above) is exactly what I learnt at school. It is a form of Mandarin (zh-hk), albeit a distinct variant of it. It’s wrong to lump it together with the vastly different zh-cn, but it’s even more wrong to call it “Cantonese”, which it’s not.

As to “yue-shumianyu”, let me make an analogy. Scholars used to have to write in Latin, even if they spoke English. We could have called the Latin that an English scholar used to write en-mediaeval-academic it’s still not a variant of English, it’s Latin, even if English academics wrote in a distinct variant of Latin. This is exactly what “yue-shumianyu” is. The correct code for this “variant” is zh.

@Pginer-WMF – As to “helping the MT”. I’m really sorry if you didn’t understand how much of a waste of time it is even though I mentioned my experience. It’s just not practical to fix any MT longer than a stub. Also, in case you’re not aware of this, we are literally working out basic things like orthography (I’d assume many other non-Mandarin Chinese languages are doing the same thing and I have reason to believe we’re actually quite far ahead of some of them). There is a reason we consider ourselves a small language community. The MT will not learn anything from user input even if our editors are superhuman.

@Pginer-WMF – At this point I’d suggest that not only should MT be disabled (not reverted to opt-in, but outright disabled), both data sets should be thrown out as invalid data, and any data in the MT should be nuked because they have been poisoned by the two invalid data sets. If WMF still believe MT had any merit for yue, everything should be restarted from scratch.

I apologise for a mistake I made above. I misread "Yue-Kouyu" as "Yue-Kuoyu". The so-called "Kouyu" (口語; Mandarin Chinese word for "Spoken Language") would indeed be closer to what's being used in yue.wikipedia.org. However, Bing does not seem to generate "Kouyu", but is slightly better than NLLB-200 since it does generate commas properly and does sprinkle some uniquely Yue words here and there. Otherwise it generates mostly cmn content.

I want to note that for individual, commonly used phrases, Bing does generate accurate responses more often, e.g.

OriginalBingBing explanationNLLB-200NLLB explanation
You are welcome唔使客氣"You are welcome" ✅你好,歡迎"Hello,welcome" ❌
Good morning早晨"Good morning" ✅您好,早上"Hello (honorific),morning (noun)" in cmn ❌
Nice to meet you見到你好开心"I am very glad to see you" in yue-Hans ❌歡迎認識你"Welcome to learn about you" in cmn ❌
Hello你好"Hello" ✅你好,你好"Hello,hello" ❌
I'm thirsty我口渴喇."I'm thirsty。" ⚠️ wrong punctuation我渴了"I'm thirsty" in cmn ❌
What is the Wi-Fi password?Wi-Fi密碼係乜嘢?"What's the Wi-Fi password?" ✅Wi-Fi 密碼係咩?"What's Wi-Fi password?" ⚠️ wrong punctuation
How much does it cost?幾多錢?"How much?" ✅價格係幾多?"How much is the price (economic term)?" ❌

I can only describe the NLLB translations as comical.

And of course, NLLB just fails completely when presented with anything more than the above.

Thanks for the additional clarifications and examples @H78c67c, and @Al12si.
Based on this, I understand that there is a huge distance between what NLLB-200 currently provides for Yue and what is useful to contribute to Yue Wikipedia. I think, based on all this, that it makes sense to disable NLLB-200 for Yue for now. I'll create a ticket to capture this.

On a related note, long time back due to the lack of specific MT for Cantonese, it was requested to enable Google Translate support for Mandarin Chinese in Traditional script (zh-TW) when translating to Yue as a potentially useful starting point. Does it make sense to keep that? Would it make sense to enable a similar variant for NLLB-200 too (you can try it here by selecting "Chinese traditional")?

I also added a comment in the current request for Bing Translate (T90207#8692310) to mention that it would help Yue Wikipedia.

@Pginer-WMF – As to “helping the MT”. I’m really sorry if you didn’t understand how much of a waste of time it is even though I mentioned my experience. It’s just not practical to fix any MT longer than a stub.

Sorry for the confusion. I was listing all options. Learning from the corrections is an approach that makes sense only when the machine translaiton is provided for the same language as the wiki. Based on the information shared, it does not seem to be the case here.

@Pginer-WMF I have done some sloppy work on “translating” zh-TW to yue-HK using string substitution (you can probably find it on github, it’s there; it’s very project-specific). For sure a proper ML network would perform much better, but I’d be inclined to say it’s still going to be a very inexact art. This has to do with the difficulty in parsing Chinese.

Google Translate is terrible. It lumps zh-CN, zh-HK and zh-TW together as if Chinese were a single unified language. So no, keeping Google Translate, even if we started with zh-TW, would not help. zh-TW > yue is going to be very inexact. I’ve in fact done some work involving some unknown variant of zh-Hant, from a large publisher, I almost had a heart attack because I thought the book was full of errors; it turned out the book was written in zh-TW (my impression is that the Taiwanese community in Canada is relatively small), so no, depending on context and whether context is even available, even zh-HK and zh-TW can be very different. And since zh-TW used to mean “any traditional Chinese”, we can’t really trust zh-TW is really Taiwanese.

BTW, I won’t be so concerned with wrong punctuation, provided the MT engine, as a final step converts the punctuation (most of the time punctuation can be converted). However, this can’t be said of the comma and quotation marks (and maybe also the period) – I’d be a lot more concerned about any MT engine that gets commas, quotation marks or periods wrong (this also applies to zh-TW and zh-HK, or any zh-Hant in general for that matter, if zh is ever going to be split).

In Chinese languages written using the traditional script, we distinguish between two kinds of commas, the so-called CJK comma (、) separates short items in a list, and the so-called full-width comma (,) functions as a normal comma. The English comma (,) for Chinese text is only used in zh-CN.

Also, we use traditional sinacized quotation marks (「」, 『』, called corner brackets in ja) for quotation marks. We do not use English quotation marks for Chinese text. (Again, English quotation marks for Chinese text is only used in zh-CN.)

So for punctuation, any MT chosen will need to distinguish between the two kinds of commas, and use the correct kind of quotation marks.

The normal period we use is the so-called CJK period (。). In theory, any sentence-ending period (if the MT is able to distinguish between sentence-ending periods and other kinds of periods) can be converted to a CJK period. In practice, you’ll need to be very careful because there is also the so-called full-width period (.) which is used in zh-TW as a separator.

zh-CN also has different punctuation rules regarding commas and periods. zh-CN punctuation is closer to modern English; zh-HK, zh-TW and yue-Hant (we don’t accept yue-Hans on yuewiki) use punctuation that’s closer to French (as commonly used, not as prescribed) or closer to 19th century English. This is an additional reason why MT must not start from anything derived from zh-CN.

I created a ticket (T333835) to remove the current machine translaiton support provided by NLLB-200 and Google.

In terms of improving the translation support for Cantonese in the long term, it may be interesting for the Cantonese WIkipedia community to connect with the Opus project. They are collecting multilingual data with open license that is used to train machine translation systems. I'm sure they will be very interested to (a) verify their data for Yue is actually Yue, and (b) incorporate new data sources you may suggest in order to expand the coverage for Yue. For now it seems that what Opus has is based on Wikipedia translations (sample), Mozilla localizaton (sample) and Tatoeba (sample):

Screenshot 2023-04-03 at 16.16.52.png (247×923 px, 97 KB)

The Opus project does look interesting. Thank you for the idea! I am curious how some poorly translated MT sentences and old (bad) revisions of translated sentences managed to get into their Wikimedia dataset, and I'd love to help provide some input to improve yue corpora.

The Opus project does look interesting. Thank you for the idea! I am curious how some poorly translated MT sentences and old (bad) revisions of translated sentences managed to get into their Wikimedia dataset, and I'd love to help provide some input to improve yue corpora.

The data from published translations using Content Translation has been integrated into the Opus project. Making good translations of Wikipedia articles using Content Translaiton will help expand the corpus with higher quality examples. I don't know if there is an option to remove data that could be identified as low quality, but providing more good quality examples will help.

In addition, contributing to other of the projects integrated in Opus such as Tatoeba may be another simple option (Mozilla localization seems a more specialized/complex task, but may be interesting for some too). Also, if you know of other similar projects that have multilingual resources that can expand Opus with more coverage for Cantonese, you can share that with the Opus team for them to integrate those.

We can consider exposing the Opus model for initial machine translation on a test instance to check if it is minimally useful (T333969).

While this is certainly interesting, as the Published Translations page already acknowledges,

When automatically aligning the sentences, it is good to remember that the translations do not necessarily match 1:1.

Looking through the Opus en-yue samples from Wikimedia, the issue is definitely present. Many longer translated sentences are not at all lined up with the original text.

This is something that should have been foreseen, as this is the defining quality that sets manual translations apart from the low quality MTs. Translations often don't line up the way ContentTranslation hopes to enforce. There should be some manual intervention or review to determine/confirm what context each sentence corresponds to, and not just let automated tools do their work, because the assumptions they work upon are fundamentally flawed.

Update: We've disabled machine translation to Cantonese Wikipedia (See: T333835)

@Pginer-WMF I noticed this diff article: https://diff.wikimedia.org/2023/06/13/mint-supporting-underserved-languages-with-open-machine-translation/, and I think it's worth reiterating that there is no such thing as the "yue-Shumianyu" "variant". "Shumianyu" is a cmn label for written Mandarin (cmn/zho), and cmn and yue are mutually unintelligible. Again I have been unable to locate any information about the NLLB training data for yue, so unfortunately I cannot comment directly on its validity; but seeing that the "yue" dataset for FLORES is entirely not yue , it would not surprise me that the NLLB data is equally invalid and in a completely different, mutually unintelligible language.

@Pginer-WMF I noticed this diff article: https://diff.wikimedia.org/2023/06/13/mint-supporting-underserved-languages-with-open-machine-translation/, and I think it's worth reiterating that there is no such thing as the "yue-Shumianyu" "variant". "Shumianyu" is a cmn label for written Mandarin (cmn/zho), and cmn and yue are mutually unintelligible.

Thanks for the reminder. Based on the above community input, we won't be using NLLB-200 to support Cantonese.
For Cantonese we are considering to enable the translation model based on OpusMT (T333969). For Cantonese the Opus project seems to be using data from Tatoeba (sample) and Mozilla localizations (sample). It would be great to get confirmation on whether the data corresponds to Yue language. If the samples look promising or more evaluation is needed, we can enable the OpusMT support to try in a more realistic scenario. In any case, if the feedback suggests that OpusMT is not useful, we are happy to remove it.

I also captured that integrating Bing (T90207) could be helpful for Yue, but that is an independent track.

KartikMistry changed the task status from Open to Stalled.Jul 3 2023, 6:18 AM

A user left a new comment on the village pump in response to @UOzurumba's comment, and I'd like to bring it up here so it doesn't get lost when archived. It's honestly audacious and incredibly disrespectful to treat the lack of response as community consensus to allow this to be pushed forward, especially when the original announcement was vague and ambiguous, and that the entire notice was never translated to Cantonese, denying users who don't know English from voicing their concerns. Sincerely I would like to ask everyone involved in that decision to be mindful of this when introducing and communicating any further changes.

A user left a new comment on the village pump in response to @UOzurumba's comment, and I'd like to bring it up here so it doesn't get lost when archived. It's honestly audacious and incredibly disrespectful to treat the lack of response as community consensus to allow this to be pushed forward, especially when the original announcement was vague and ambiguous, and that the entire notice was never translated to Cantonese, denying users who don't know English from voicing their concerns. Sincerely I would like to ask everyone involved in that decision to be mindful of this when introducing and communicating any further changes.

Thanks for sharing the comment on the ticket.

Content Translation was made available as a beta feature in 2015, and data shows it produced a positive impact. Since then, we have enabled the tool by default in 182 Wikipedias (out of over 300 Wikipedias). In each case we wanted to ping the communities to make them aware of the change and use it as an opportunity to hear about how the tool is working for them, to make adjustments as needed.

We can make more efforts to reach to more people, explain changes better and involve more people of the community, but I don't consider our approach "incredibly disrespectful". We are approaching this announcement with a very honest way and willing to adjust or revert any decisions based on the community input. For example, the communication with Cantonese Wikipedia has been useful to identify issues with the machine translation not providing support for the right language and we disabled it in response (T333835).

why the F you switch it off completely? this is a big step backward from T199523 and T258919.

i'm not gonna contribute to yuewp without the convenient aid of CX.

people can still machine translate with or without the tool, but users like me willing to make good use of it are now prevented from using CX which saves a lot of time on many timeconsuming trivial things including but not limited to formatting, syncing of references and categories...

We can make more efforts to reach to more people, explain changes better and involve more people of the community, but I don't consider our approach "incredibly disrespectful". We are approaching this announcement with a very honest way and willing to adjust or revert any decisions based on the community input.

I have no doubts your team is approaching the work here in good will, but the communication has proven ineffective, hence my previous comment, forwarding the local sentiment in hopes to push for more inclusive dialogue going forward.

Regarding machine translation, I am not opposed to re-enabling Google Translate to zh-tw for experienced users who know what they are doing. Google Translate works well enough for Mandarin, and can be helpful to users who are more familiar with Mandarin than other source languages.

Per community request machine translation was disabled for Cantonese (T333835). The translation provided by NLLB-200 modes was not considered useful. We are working to integrate OpusMT models (T333969) tat also support Cantonese. These are based on Wikipedia translations (sample) from Cantonese Wikipedia and other sources such as Tatoeba (sample).

Once the OpusMT model is available, we are interested in hearing from the community about whether OpusMT support for Cantonese is useful as a starting point for translations or it is better to start from scratch.

In the case that OpusMT models are not useful, it would be good to get consensus on whether it is useful to expose the Traditional script variant of Chinese MT available in services such as Google Translate (T258919).