Page MenuHomePhabricator

Remove Machine Translation from Content Translation on Indonesian Wikipedia
Closed, ResolvedPublic

Description

Per https://id.wikipedia.org/w/index.php?title=Wikipedia:Warung_Kopi_(Teknis)&oldid=14962460#Mesin_penerjemah Indonesian Wikipedia Community decide to remove Machine Translation from Content Translation tools/extension.

Related Objects

Mentioned In
T299636: Disable ContentTranslation for non-extended confirmed users on viwiki
T286636: Measure the number of wikis were translations are deleted more often than new articles
T246324: List top Wikipedias with high deletion ratios for Content Translation
T228971: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 70%
T222882: Article cannot be published without any visible sign other than console error
T222782: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 40%
T221930: Threshold to prevent publishing needs more precision
T221353: Make more strict the check for unmodified content for the whole document on Indonesian Wikipedia
Mentioned Here
T228971: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 70%
T222905: CX2: Restoring logic broken when article starts with empty paragraph
T222882: Article cannot be published without any visible sign other than console error
T215403: CX2: Don't apply unmodified content restrictions to translations published to the user namespace
T222779: Update unmodified content error and warning messages to distinguish them better
T222782: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 40%
T203377: CX2: Additional details for too much unmodified content error
T221930: Threshold to prevent publishing needs more precision
T221353: Make more strict the check for unmodified content for the whole document on Indonesian Wikipedia
T221359: Adjust publishing restrictions based on the number of paragraphs affected and previous translations by the user
T86700: Add more machine translation services (tracking)

Event Timeline

Thanks for reporting this, @Aldnonymous. Getting feedback about how our tools work on different languages is very useful. Before considering disabling machine translation, which will affect also users making good use of it, we want to explore how to better adjust the current system to prevent the publication of contents that contain too much unreviewed machine translation. I started the conversation in Indonesian Wikipedia to get further feedback. The translated version of the message is shared below:

Thanks for sharing your concerns about the quality of translations. We want to help improve this situation, but disabling machine translation may not be the best option since it will also impact those users making a good use of the tool. Instead, it is preferred to only limit the use of the tool to those users making a potentially bad use.

Before considering disabling machine translation for everyone, we would like to adjust the current mechanisms to explore how much we can isolate the low-quality contributions from the positive ones. As we make adjustments it would be very useful to continue receiving feedback to better understand the current irregular flow of deleted translations, and determine what can we adjust to make the tool better.

In particular we would like to hear about the following questions:

  1. Your impression about the tracking category: is it working as expected? Is it not strict enough, leaving out many other problematic articles? Is it too strict, including articles that are not really problematic?
  2. Feedback about the current thresholds. Based on your experience, do you think it would help to show warnings more widely to users and/or limit more the options to publish when they don’t meet these thresholds?

Please help us get a better understanding of the current situation in order to inform our decisions.

For further context: Content translation keeps track of how much of the machine translation the user edits on each paragraph. It shows a warning when there is 80% or more of unmodified machine translation on any paragraph, and the translation is added to[[ https://id.wikipedia.org/wiki/Kategori:Halaman_dengan_terjemahan_tak_tertinjau | a tracking category ]] if it is published without being edited further. For cases where there is 99% or more of unmodified content for the whole document, the user is prevented from publishing the translation.

Since each language is different, the above percentages may need adjustments for Indonesian. That is an iterative process that requires the input from Indonesian community, but we think can help reduce the number of problematic articles without limiting those editors doing a good use of the tools.

Appreciate your help with this!

Hello, and sorry, the concern is valid. We just don't have enough hand to monitor and fix those translation, this already crippled more than half of our editors, instead of creating articles now they seems to only fix bad translation. The articles that got translated by translator is not expert in the subject they try to translate, the editor who fixing it are also not expert for the articles subject. Even after fixing the translation the article quality is so sub-par about +30% of them got deleted. The user who keep trying to make these bad translation in the end, got blocked due either for bad translation, miscategorized vandalism and/or crosslanguage disruption, and there are countless of them. I don't want to see this happen. The deletion and blocking keep continued to this day, and I am powerless to stop this. Again, sorry and thank you for replying back to us, I hope this problem will be resolved in the future.

Hi, the community here has decided to disable Machine Translation because it is creating many problems for us. The English Wikipedia has also implemented the same measure (disabling machine translation in content translation) for the same reason: "raw or lightly edited machine translations have long been considered by the English Wikipedia community to be worse than nothing."
The new feature of preventing the publication of unmodified translation does not really solve the most fundamental problem that we encounter here: inaccurate and unnatural translation. Machines cannot understand context. As a result, it often produces wrong or unnatural translation. So even if someone decides to modify some words in the translation to pass the technical hurdle, it often still contains wrong or unnatural translation.
As an example, this article is translated by using machine translation. The user has modified it enough to pass the technical hurdle, but the result is still very awkward and unnatural, and even inaccurate in some cases. As a result, the effort to nominate this article as a featured article has failed twice. See: https://id.wikipedia.org/wiki/Wikipedia:Artikel_pilihan/Usulan/Akra_(benteng) and https://id.wikipedia.org/wiki/Wikipedia:Artikel_pilihan/Usulan/Akra_(benteng)/2 .
We want to disable this feature to encourage users to translate with their own words and to take their time in doing so, so that the result will have high quality. The best translations here are made manually, not by relying on machines. See Kleopatra, Abad Pertengahan, dahagi di atas Bounty, etc that were translated by using manual translation instead of the machine one, and the result is very outstanding and cannot be compared with low-quality machine translation.
So could you please just disable the machine translation, in accordance with the will of the community?
Thank you.

As another active editor/admin in the Indonesian Wikipedia, I unfortunately have to agree with Aldnonymous, Mimihitam, and the 100% consensus in our community talk page that this should be disabled.

The problem is that the resulting translation is worse than nothing at all. They're bad both in terms of accuracy (mistranslations) and style (the translated text does not look like something one would naturally write in the Indonesian language). As for "users making good use of it", so far I haven't found any. All of our high-quality articles are manually written, and I'm not sure if there's even one medium quality article that's machine generated. Just for curiosity I did try using the machine translation tool several times. I ended up getting frustrated because it's much harder to fix the poor translation than just starting fresh. You have to evaluate every word, decide whether they're good or bad, replace, and then only to find that the end result is still bad and doesn't read naturally. The only good part is that it copies the formatting (e.g. images, refs, categories, etc.) but this small advantage is far outweighed by the stress involved in trying to "fix" the translation, unfortunately.

That's just from a user's point of view. From the community point of view, the proliferation of these poorly translated articles causes a lot of burden on admins or patrollers. We have to inspect every article, evaluate it, and then argue or "prove" that it's bad (even though the badness is actually obvious), all of which take a lot of time. Creating a poorly translated article is relatively easy (just a few clicks), but patrolling and evaluating them takes much more time, so it's a losing battle. We'd rather have our best writers write new articles rather than wasting their time inspecting and fighting poor translations.

So, please respect the decision of the community here. If you'd like more specific feedback, feel free to reach out to me either in my id.wp or en.wp talk page. I'll be happy to talk to you constructively. Thanks!

Also it is worth noting that there are more than 250 pages in https://id.wikipedia.org/wiki/Kategori:Halaman_dengan_terjemahan_tak_tertinjau and the number is still increasing due to the availability of machine translation feature.

We don't have enough human resources to deal with these low-quality translations, we have enough problems on our hands

So please just disable the feature in accordance with the community consensus.

Thank you.

Thanks for the replies. I agree that the presence of “raw or lightly edited machine translation” is very problematic in the ways all of you described. We all want to prevent those bad translations. The main point of our proposal is precisely to prevent them, while not preventing also “heavily and carefully edited machine translation”.

I also agree that the current thresholds are not working right now, but that is the reason that we propose to adjust them. Getting the adjustments right from the beginning is hard since we are not native Indonesian speakers, but we can quickly iterate and adjust these based on your feedback.

By adjusting the thresholds we can enforce that only heavily edited machine translation is allowed for publishing. What we need to know is, based on the current quality of machine translation, how heavily a paragraph needs to be edited to no longer be considered “raw or lightly edited machine translation”.

Looking at the stats there are reasons to believe that while machine translation is misused in many translations that get deleted, but it seems to also to be used well in many other cases. Bad translations are very visible and problematic, but disabling machine translation completely has also the risk of losing the good translations that have also started from it. This is the main reason we want to make initial iterations adjusting the thresholds based on your feedback first, evaluate the results and then consider more drastic measures.

Also it is worth noting that there are more than 250 pages in https://id.wikipedia.org/wiki/Kategori:Halaman_dengan_terjemahan_tak_tertinjau and the number is still increasing due to the availability of machine translation feature.

Thanks @Mimihitam. Since you may have been looking through some of these articles, it would be very useful for us to know about the following:
Based on your experience, do you see that most of the translations added to the category are problematic (and should be deleted)? Would it make sense for Content translation to prevent users from publishing all of these? or there are also good articles that have been classified as unreviewed by mistake in the category (and should not be deleted)?

@Pginer-WMF why are you really really insisting on maintaining this feature? For us, the solution is clear and simple: disable machine translation.

"It seems to also to be used well in many other cases" and "good translations that have also started from it" are only a figment of your imagination. We, the actual part of the community, have tried to tell you here the reasons why this is simply not working, and that the best translations in the Indonesian Wikipedia are made manually (never with machine translation, none of them managed to reach featured article status), yet you simply ignore our feedback because you are so hesitant with this feature (I don't know why you are so madly in love with it). Sorry if I'm being stern, but it's just annoying when we have clearly stated that the solution that we want is to disable machine translation, yet you seem to try to force us to accept this horrible feature. As a note, not all admins have the time to patrol these machine translations, many of them managed to slip away from our sight and now are part of the statistics that you are using.

While we're at it, I have just deleted this page because the translation is very unnatural, even after the user has passed the technical threshold: https://id.wikipedia.org/wiki/Kingdom_of_al-Abwab

Clearly whatever technical threshold you are proposing will not solve the most fundamental problem posed by machine translation: the result is simply too unnatural and inaccurate. This is the main reason that the community has decided to remove this feature, just like the English Wikipedia. So please just remove the feature, thank you. We want to encourage users to take their time and use their own words to produce high-quality translations. Machine translation does not produce quality, it produces mediocrity.

@Mimihitam, I'll try to provide some clarifications below:

"It seems to also to be used well in many other cases" is only a figment of your imagination.

Last week, according to the stats, 149 translations were created with Content translation, from those 29 have been deleted (19%). From the other 120 translations, I expect several of those to be using machine translation since feedback suggests that it is used by a lot of the translations and is provided by default.

As a reference, during 2018 on Indonesian Wikipedia the deletion ratio was 9% for articles created with Content Translation, and 12.6% for new articles started from scratch.

We have tried to tell you here the reasons why this is simply not working, and that the best translations in the Indonesian Wikipedia are made manually, yet you simply ignore our feedback.

I'm not ignoring your feedback. I'm trying to find solutions that solve the problem you describe without introducing another problem to the other end. Leaving out those making a good use of machine translation seems problematic, and I think it is worth trying to solve the problem without leaving this people out.

Sorry if I'm being stern, but it's just annoying when we have clearly stated that the solution that we want is to disable machine translation, yet you seem to try to force us to accept this horrible feature.

I understand that it is frustrating and time-consuming to review low-quality articles, and I appreciate the clarity in feedback. I'm consider this a high priority to focus on, and we'll try to act as quick as possible, but I want to clarify that I'm not trying you to accept the current feature, but to help us improve it.

While we're at it, I have just deleted this page because the translation is very unnatural, even after the user has passed the technical threshold: https://id.wikipedia.org/wiki/Kingdom_of_al-Abwab

The current technical thresholds are not suiting the needs of Indonesian Wikipedia community. What I propose is to adjust them. The goal is for the adjusted thresholds to prevent translations like this one from being published in the first place.

Clearly whatever technical threshold you are proposing will not solve the most fundamental problem posed by machine translation: the result is simply too unnatural and inaccurate. This is the main reason that the community has decided to remove this feature,** just like the English Wikipedia.

Well, an hypothetical threshold that requires to change 100% of the machine translation would have the same effect as disabling it, since it would require users to rewrite the contents completely (even when machine translation translates correctly a short section title such as "History" as "Sejarah"). This seems excessive, but illustrates that what I'm proposing can get as close as needed to obtain the same results. What I'm trying to find is a better adjusted threshold that prevents the problematic translation while allowing room for good use of machine translation.

Last week, according to the stats, 149 translations were created with Content translation, from those 29 have been deleted (19%). From the other 120 translations, I expect several of those to be using machine translation since feedback suggests that it is used by a lot of the translations and is provided by default.

As a reference, during 2018 on Indonesian Wikipedia the deletion ratio was 9% for articles created with Content Translation, and 12.6% for new articles started from scratch.

Once again, as I have stated, we admins don't have the time to patrol all of the pages, since we also have to prove which part of it is actually bad. As a result, many pages slipped away from our sight.

But you should know that one of the main contributors to your statistics was this user: https://id.wikipedia.org/w/index.php?limit=50&title=Istimewa%3AKontribusi+pengguna&contribs=user&target=Adesio2010&namespace=&tagfilter=&start=&end=

Nobody watched her activities for a long time until I realized that she made this really poor machine translation: https://id.wikipedia.org/wiki/Prancis_Vichy (as you can see by the template on the top of the page).

I proceeded to delete some of her articles also. After I gave her a warning, she has renounced machine translations and now she is translating manually with a much better quality.

This is why you should not rely too much on these statistics and think that you know better than the Indonesian community. Many bad translations simply slipped away, and honestly,as HaEr48 has pointed out, we don't have enough resources to deal with this problem, which is why we want to disable this feature.

I'm not ignoring your feedback. I'm trying to find solutions that solve the problem you describe without introducing another problem to the other end. Leaving out those making a good use of machine translation seems problematic, and I think it is worth trying to solve the problem without leaving this people out.

By writing this, you are once again ignoring our feedback, because we have told you the bigger problems that we had due to the presence of machine translation feature that far outweighs the imaginary "good use of machine translation" that you kept insisting on. As I've told you many times, the result of machine translation, even if it is modified, is very mediocre, and it only becomes good if a user uses his/her own word.

Well, an hypothetical threshold that requires to change 100% of the machine translation would have the same effect as disabling it, since it would require users to rewrite the contents completely (even when machine translation translates correctly a short section title such as "History" as "Sejarah"). This seems excessive, but illustrates that what I'm proposing can get as close as needed to obtain the same results. What I'm trying to find is a better adjusted threshold that prevents the problematic translation while allowing room for good use of machine translation.

Which is why it is better to simply disable machine translation. The only threshold that will work to deal with the problem of unnatural and inaccurate translation caused by machine translation is 99% threshold, which would be too onerous. Even if you set it at 50 to 80%, it would puzzle new contributors who are unaware of this limitation. It's better to just disable it and let them use their own words. At the end, in the Indonesian Wikipedia, we believe that translating is not mere decoding (which is what the machines are doing). It is an art of retelling a story to ensure that it would be understandable by speakers who don't speak English at all.

This is frustrating, I don't have time to watch all these garbage translations, and its known fact that Machine translation have been garbage for long time. If its said only 9% of them got deleted, its because the deletion tag is still there or the report (nomination for deletion) is still there and had not been processed, as I previously said, we don't have enough time to deal with garbage translation. Most of us still busy with writing our own articles. I plead to you please just remove the machine translation , the community have spoken and decide to not use it at all, the consensus is clear. Thank you.

This is why you should not rely too much on these statistics and think that you know better than the Indonesian community. Many bad translations simply slipped away, and honestly,as HaEr48 has pointed out, we don't have enough resources to deal with this problem, which is why we want to disable this feature.

Thanks for surfacing the patrol coverage aspect, it makes perfect sense to consider that deletion ratios do not to account for content that has not been patrolled. I'm not claiming to know the community better than anyone, all the contrary, I'm just trying to better understand the problem with the tools I have at hand, including talking with community members to learn more about the issues.

For this particular aspect, I assume that the non-patrolled pages are those marked as "unchecked page" in Recent Changes. I made a query fo the pages created with Content Translation in the last two weeks and found that 60 were marked as "unchecked" out of 116. That's 36% of unpatrolled pages. If we revisit the numbers from last week I mentioned above to remove the 36% of unpatrolled pages, there would be still 77 pages out of the 149 pages created (51%) which have been reviewed and not deleted. Are those numbers aligned with your impressions? Did you expect more or less pages to go unpatrolled?
These are just rough numbers on a particular period, and may not be accurate, but I think they are helpful to identify which additional aspects we need to consider to better characterize the problem.

Well, an hypothetical threshold that requires to change 100% of the machine translation would have the same effect as disabling it, since it would require users to rewrite the contents completely (even when machine translation translates correctly a short section title such as "History" as "Sejarah"). This seems excessive, but illustrates that what I'm proposing can get as close as needed to obtain the same results. What I'm trying to find is a better adjusted threshold that prevents the problematic translation while allowing room for good use of machine translation.

Which is why it is better to simply disable machine translation. The only threshold that will work to deal with the problem of unnatural and inaccurate translation caused by machine translation is 99% threshold, which would be too onerous. Even if you set it at 50 to 80%, it would puzzle new contributors who are unaware of this limitation. It's better to just disable it and let them use their own words. At the end, in the Indonesian Wikipedia, we believe that translating is not mere decoding (which is what the machines are doing). It is an art of retelling a story to ensure that it would be understandable by speakers who don't speak English at all.

I'm not sure why you consider that 99% threshold is the only that could possibly work. I think it is worth trying some threshold that is high enough but still allows people to use machine translation properly. Even if that does not happen much now, making it the only possible option will make it more likely to happen in the future (which seems better than preventing it).

It is also unclear to me that users would prefer not having machine translation rather than have it with limits to edit it significantly. In my experience, having some form of machine translation is a common expectation and when it has not been present we got requests to enable it (T86700).

For this particular aspect, I assume that the non-patrolled pages are those marked as "unchecked page" in Recent Changes. I made a query fo the pages created with Content Translation in the last two weeks and found that 60 were marked as "unchecked" out of 116. That's 36% of unpatrolled pages. If we revisit the numbers from last week I mentioned above to remove the 36% of unpatrolled pages, there would be still 77 pages out of the 149 pages created (51%) which have been reviewed and not deleted. Are those numbers aligned with your impressions? Did you expect more or less pages to go unpatrolled?

We have not patrolled those which belong to the "unchecked page" category and also those which are OUTSIDE of it. Why are you assuming that pages that do not fall into that category would automatically be great???? We can see that you have a certain bias for machine translation here. As we have repeated here countless times, machine translation is unnatural and inaccurate. We don't have enough resources to patrol hundreds of machine translations in a month, which is WHY we decided to disable this feature. Why can't you respect this? You think that you know better than the community, it's just exasperating. Are you the overlord of the Indonesian Wikipedia now???

I'm not sure why you consider that 99% threshold is the only that could possibly work. I think it is worth trying some threshold that is high enough but still allows people to use machine translation properly. Even if that does not happen much now, making it the only possible option will make it more likely to happen in the future (which seems better than preventing it).

It is also unclear to me that users would prefer not having machine translation rather than have it with limits to edit it significantly. In my experience, having some form of machine translation is a common expectation and when it has not been present we got requests to enable it

Should I repeat this with a bold?

Machine translation does not solve the most fundamental problem we face: inaccurate and unnatural translation. It does not matter what kind of technical threshold you want to implement, manual translation is always better than machine translation.

This page was published without any "unchecked page" category in it, and yet it has a really bad translation that a maintenance template has been put on top of it: https://id.wikipedia.org/wiki/Prancis_Vichy

Indonesian Wikipedians are also not expecting machine translation to be available. They want it TO BE REMOVED. Now you really think that you know better than the community!!

This will be my last comment (until the community consensus is respected), I'm a bit fed up of these replies, just a suggestion, DO NOT SUPERVOTE COMMUNITY CONSENSUS. Good bye.

While we're speaking here, this user has just spammed 8 ARTICLES in a day, not to mention the previous articles that slipped away from the admin's sight: https://id.wikipedia.org/wiki/Istimewa:Kontribusi_pengguna/Ardfeb (all have just been deleted)

We don't have enough resources to deal with this, which is why we have decided to disable machine translation.

As I mentioned above, what we are proposing is to disabling the use of machine translation for those users making a bad use of it. I think that is aligned with the request of the community.

We can raise the limits as high as needed to prevent unreviewed translations to be published, but I've not heard the case for preventing those translations that make a good use of machine translation (even if we assume they are a small minority now).

These are two high priority changes we plan to make to the system:

We expect these to improve significantly the current situation. If they do not have the desired effects we'll evaluate the next steps, which can include more drastic measures such as disabling machine translation for some or all users. In this process it is important your feedback.

@Pginer-WMF what the community wants is high quality translation, clear and simple, and also a mean to ensure that admins are not overwhelmed by horrible translations, as is the case currently thanks to your machine translation feature. This is why the community voted to disable this feature.

We don't want the Indonesian Wikipedia to become a google translate version of the English Wikipedia.

If you want to use the technical threshold as a middle ground, I would suggest the limit to be at least 30%, since our community rules clearly state that lightly edited machine translation is strictly prohibited. What we want is a translation that retells the story, not a mere robotic decoding.

It's either 30% or disable machine translation.

Please respect the community's decision.

If you want to use the technical threshold as a middle ground, I would suggest the limit to be at least 30%, since our community rules clearly state that lightly edited machine translation is strictly prohibited. What we want is a translation that retells the story, not a mere robotic decoding.

Thanks for the feedback. I adjusted the proposal in T221353 to use the 30% margin as a reference.
Once the threshold is updated we can try how this works in practice, check the effect in published and deleted translations, and hear about your impressions for further adjustments if needed.

To be honest, I'm still not happy about the subverting of the community's will here. But I think we can at least live with it for the time being, and monitor how effective this will be in stopping bad translations.

The initial adjustment of the thresholds just went live (T221353). It would be very helpful if you could help us check the following aspects to better understand the impact on Indonesian content:

  • Evaluate if the change is effective to avoid problematic translation. Review the translations created with Content translation focusing on those created after the change (April 23 7pm Jakarta time), and report cases of translations published with too much unreviewed machine translation, or the lack of them.
  • Evaluate if it is still possible to create good translations. Try to create a proper translation using machine translation. For testing purposes you can publish under your user namespace by selecting the settings icon next to the publish button. After editing the initial translation enough to make it read natural in Indonesian you should be able to publish without errors.

Based in our initial tests it seems the limits may be now a bit too strict making it hard to translate elements such as infobox templates where it is ok to keep part of the contents such as file names or numbers unmodified. But we want to hear from your experience since feedback from those speaking the language is really useful.

Thanks!

Hello Pginer-WMF, first I have to apologize for my tone / harsh words, I do hope we can work together in the future.

Now about this issue, Mimihitam seems agreed with 30%, we will keep monitoring this. Thank you.

Hello Pginer-WMF, first I have to apologize for my tone / harsh words, I do hope we can work together in the future.

Now about this issue, Mimihitam seems agreed with 30%, we will keep monitoring this. Thank you.

Thanks for your message, @Aldnonymous.
We are learning from this process and exploring ways to identify these issues to adjust the tool behaviour before a community is flooded with too many translations to review. In any case, understanding the quality of the translations created is something we cannot do alone, so thanks for providing feedback and do not hesitate to continue doing so.

@Pginer-WMF Are you sure that the 30% threshold has been set? I tried it myself. I only changed one or two words in the first sentence and the translation was published! See https://id.wikipedia.org/wiki/Serangan_John_Brown_ke_Harpers_Ferry

@Pginer-WMF

This 30% threshold does not work at all. I tried it again in https://id.wikipedia.org/wiki/Kekuasaan_Venesia_di_Kepulauan_Ionia and I only changed 1 word, and it got published easily.

If you are wondering why we really want this feature to be shut down, see https://id.wikipedia.org/w/index.php?diff=15026001&oldid=15025985&title=Serangan_John_Brown_ke_Harpers_Ferry&type=revision

A good translator will never translate "Brown's party of 22" into "Partai Brown 22", because in English, that actually means "Brown 22 political party"!! This clearly demonstrates how horrible machine translation is.

Since this option has failed, please disable machine translation entirely as was agreed by the community in the first place. Do not supervote the community consensus. If you still refuse to do so, then clearly @Pginer-WMF thinks that he is the supreme overlord of the Indonesian Wikipedia who knows better than the community.

Thank you

Tag: @Aldnonymous @HaEr48

@Pginer-WMF Are you sure that the 30% threshold has been set? I tried it myself. I only changed one or two words in the first sentence and the translation was published! See https://id.wikipedia.org/wiki/Serangan_John_Brown_ke_Harpers_Ferry

Hi @Mimihitam, thanks for testing and reporting.

I looked into that example and there is an error in the way the threshold is calculated for the whole document. It is failing heavily for the case you described. The threshold was originally designed with a different purpose (prevent the most prominent vandalism) where the error was less visible, and we quickly adapted the threshold to respond to the problems of the Indonesian community in order to prevent the wiki to be flooded with low quality translations as soon as possible.

Regarding the overall effect of the change: since the threshold was updated, apart from your test translations, there have been 4 articles created using Content translation. Even if the threshold is malfunctioning now, the volume of translations seems to have reduced (in previous weeks stats show 50-100 translations per week). This decrease in the number of translations seems to be helping a bit to make the situation less critical, but I agree that this is not enough.

I captured the issue with the way this threshold is calculated (T221930) and marked it as a high priority. I'd have preferred for the system to be perfectly adjusted in the first iteration, but some iteration and feedback is needed for any collaboration before reaching conclusions. Since the system is not counting reliably the number of modifications, we didn't really had the chance to evaluate the content produce when we enforce MT not to go beyond 30%.

As I mentioned in T219851#5122876, we are also in the process of making other improvements on how the thresholds work at the paragraph level (T221359) that will also help to prevent problematic translations. We'll try to make those adjustments without much delay, but we'll need feedback to understand the effect and catch issues. Your feedback is essential in this process. Thanks for sharing it.

@Pginer-WMF
What a bunch of bollocks. Someone just posted this unedited translation easily https://id.wikipedia.org/wiki/High-voltage_direct_current

Please disable machine translation as was agreed by the community. We've run out of patience. We are already fed up because you are supervoting the consensus, and now your so-called threshold does not even work..

The calculations of the total percentage of unmodified machine translation has been updated to make it more reliable now (T221930). We welcome you to give another try and report issues you may found. In particular:

  • Evaluate if the change is effective to avoid problematic translation. Review the translations created with Content translation focusing on those created after the change (May 1st onwards), and report cases of translations published with too much unreviewed machine translation, or the lack of them.
  • Evaluate if it is still possible to create good translations. Try to create a proper translation using machine translation. For testing purposes you can publish under your user namespace by selecting the settings icon next to the publish button. After editing the initial translation enough to make it read natural in Indonesian you should be able to publish without errors.

We still have improvements planned before we consider this intervention to be complete, but getting early feedback is very useful in the process. Upcoming improvements include the following:

  • T203377: Explain the limits in more detail to let affected users know why they cannot publish.
  • T221359: Made the thresholds at the paragraph level more strict by considering the number of paragraphs with unmodified translation, and the number of previously deleted translations by the user.

Thanks!

@Pginer-WMF I tried to translate with my own words without using any machine translation, and CT still won't let me publish it. Your feature is so buggy, please just disable machine translatino as mandated by the community in the first place.

@Pginer-WMF I tried to translate with my own words without using any machine translation, and CT still won't let me publish it. Your feature is so buggy, please just disable machine translatino as mandated by the community in the first place.

Can you share in more detail what you did and which was the article you were trying to translate?
I tried to reproduce the issue but was not able to do so:

  • I started a new test article, selected the option to "start with an empty paragraph" wrote my on text and was able to publish.
  • Then I used machine translation for another article and replaced most of it with my own text (just writing "test" for the example) and was able to publish it too.

Thanks!

@Pginer-WMF I've tried it again, and this time it works https://id.wikipedia.org/wiki/Joseph_Buttigieg

So I suppose the threshold is now working well, and it is as it should be.

But could you make one more adjustment to the feature? For the "add translation" button, could you set the default to be the source version instead? After that the translator can choose if he/she wants to switch to machine translation.

We only need this adjustment and if it is done, we can consider this case to be solved and we won't have to discuss this again.

Thanks.

Hi @Pginer-WMF. I just wrote my opinion feedback there. Sorry, for late response.

Thanks for reporting this, @Aldnonymous. Getting feedback about how our tools work on different languages is very useful. Before considering disabling machine translation, which will affect also users making good use of it, we want to explore how to better adjust the current system to prevent the publication of contents that contain too much unreviewed machine translation. I started the conversation in Indonesian Wikipedia to get further feedback.

@Pginer-WMF the threshold is already working well. The only final adjustment that needs to be made is that source version should become the default instead of machine translation. Thank you.

Hi @Pginer-WMF. I just wrote my opinion feedback there. Sorry, for late response.

Thanks for your reply, @williamsp.

Currently we increased the limits to make sure translations can only be published when 70% or more of the initial machine translation has been modified by the user. That is, accepting only 30% or less of unmodified content. In this way users making a good use of machine translation are still able to use it.

Please, can you try making some translations and report how the limit is working for you? Let us know if you have any problem trying to publish a good translation. That would be very useful to adjust the limits further if needed.

Thanks!

@Pginer-WMF the threshold is already working well. The only final adjustment that needs to be made is that source version should become the default instead of machine translation. Thank you.

It is good to hear that the adjusted threshold seems to work. Thanks for the feedback, @Mimihitam.

Before declaring this as resolved, I'd recommend observing for a while the content produced with the new configuration and hear some more feedback. The reason is that Wikipedia content is very diverse and what works for some articles may require adjustments to support other cases. For example, the current threshold may or may not work for articles with abundant math formulas or latin names of plants that may require less modifications of the original content for a good translation.

I think it is better to focus first on making sure that articles created using machine translation as a starting point can be published when edited appropriately, and then, consider which approach to start is the most useful for most editors to make it the default (machine translation or copying the source contents).

@Pginer-WMF : Hi, I just try to publish my translation (Asuka Langley Soryu from en to id, Translation ID: 648911), but it won't let me. Even after I change the publish destination to my own user namespace. The button "Publishing..." stays in grey after clicking, and the Developer Console says this:

In Firefox:

TypeError: string is undefined load.php:2293:192
  jQuery:
  tokenise
  getUnmodifiedMTPercentageInTranslation

In Chrome:

Uncaught TypeError: Cannot read property 'match' of undefined
  at Object.mw.cx.TranslationTracker.static.tokenise (<anonymous>:872:206)
  at MwCXTranslationTracker.<anonymous> (<anonymous>:881:901)
  at Array.forEach (<anonymous>)
  at MwCXTranslationTracker.mw.cx.TranslationTracker.getUnmodifiedMTPercentageInTranslation (<anonymous>:881:774)
  at MwCxTranslationController.mw.cx.TranslationController.checkForMTAbuse (<anonymous>:869:815)
  at MwCxTranslationController.mw.cx.TranslationController.publish (<anonymous>:866:183)
  at VeInitMwCXTarget.oo.EventEmitter.emit (<anonymous>:68:486)
  at VeInitMwCXTarget.ve.init.mw.CXTarget.onPublishButtonClick (<anonymous>:347:461)
  at VeUiCXPublishTool.ve.ui.CXPublishTool.onSelect (<anonymous>:357:477)
  at OoUiBarToolGroup.OO.ui.ToolGroup.onMouseKeyUp (<anonymous>:126:569)

Beside that, when I copy a translation which already pass the threshold, and paste it into new paragraph, it detected as 100% unmodified. And, I cannot edit this pasted text anymore. When I reload the page, it reverted back. So, it doesn't let me to move a part of paragraph (split) to a new paragraph. Can you check it? Thank you.

@Pginer-WMF : Hi, because I still can't publish Asuka Langley Soryu mentioned before, I try to translate another article: en:Banana flour to id:Tepung pisang (Translation ID: 653570). And now, once again, I cannot publish it. But this time with visible error in UI: Your translation contains 77% unmodified text. So, I think the overall threshold is too high... (But in the home page, when I hover the article in the list, it shown only 26% from machine translation. Why the number is different?)

Each of the paragraph already pass the threshold (no more warning), but it gave me previous error when I try to publish it, even if the destination is my user namespace. Is there anyway to publish it, because copying directly from CX into VE gives me some weird span tags and escaped characters. How can I extract the wikitext from CX so I can publish it manually?

Thanks.

@Pginer-WMF : Hi, because I still can't publish Asuka Langley Soryu mentioned before, I try to translate another article: en:Banana flour to id:Tepung pisang (Translation ID: 653570). And now, once again, I cannot publish it. But this time with visible error in UI: Your translation contains 77% unmodified text. So, I think the overall threshold is too high... (But in the home page, when I hover the article in the list, it shown only 26% from machine translation. Why the number is different?)

Thanks for the feedback, @williamsp.
Based on your comments, I proposed to increase the threshold from 30% to 40% (T222782), and reevaluate. This should provide a bit more room for cases where the machine translation was good enough for not requiring very heavy changes, and avoids the need of less clean workarounds.

Currently there are two different thresholds, one for the whole document that prevents you from publishing, and one for individual paragraphs which is just a warning (let's you publish but adding the contents to a tracking category for the community to review).

I proposed also to clarify the messaging to reduce confusions (T222779). This would help to distinguish each case, although it will be still possible to have no individual paragraph with more than 80% (the threshold for each paragraph warning) of unmodified contents and still have a 70% of unmodified contents for the whole document.

Each of the paragraph already pass the threshold (no more warning), but it gave me previous error when I try to publish it, even if the destination is my user namespace. Is there anyway to publish it, because copying directly from CX into VE gives me some weird span tags and escaped characters. How can I extract the wikitext from CX so I can publish it manually?

We have plans for not applying unmodified content restrictions to translations published to the user namespace (T215403). This would allow some of the more advanced workflows by experienced editors as detailed in the ticket. However, we would like to adjust the thresholds first. Having the limits applied to the user namespace is very convenient for testing purposes (e.g., checking whether a bad translation could be published with no risk of it leaking to the main namespace).

@Pginer-WMF : Thank you for creating a solutions for the second problem (T219851#5166329) by lowering the threshold for the whole document and removing the limitation for user namespace. So, before those solutions are released, I guess I can only leave my translation there without the ability to publish it, right?

Beside that problem, do you already check my first problem (T219851#5163519)? No visible error or warning there. When I click "Publish", it just stuck. But in the console, I found "Uncaught TypeError: Cannot read property 'match' of undefined" like mentioned on the post before. Do you know why this happened and how to resolve it?

NB: You can move this problem to a new task, if you think it is unrelated to this task.

Thanks.

As a compromise, please set "copying the source content" as default, because the problems usually stem from new users.

As a compromise, please set "copying the source content" as default, because the problems usually stem from new users.

Thanks. It is useful to hear that user expertise may be an aspect worth considering. We'll take that into account for both analysing the results and adjusting the thresholds.

Regarding the default values ("Copying from source" or using machine translation), I'd recommend to make such decision after we complete the process of adjusting the thresholds (there are still a few changes in progress: T221359, T222782, T222779). In that way, the community can make a decision based on a better adjusted version of machine translation support and the contents created with it (i.e., how many translations were created with/without machine translation, and how many of those were deleted). Once that evaluation is complete, we can set the default approach to the one that works the best for Indonesian Wikipedia.

@Pginer-WMF : Thank you for creating a solutions for the second problem (T219851#5166329) by lowering the threshold for the whole document and removing the limitation for user namespace. So, before those solutions are released, I guess I can only leave my translation there without the ability to publish it, right?

Waiting is one option. That would be very useful to test whether the new adjustments work as expected for such case. Considering the review and deployment cycles, the changes should be available in about a week.

The alternative is to edit the translation further so that it contains more text written by you and less of the original text. But if the translation has been edited to the point where you consider it reads well in Indonesian, I think it is better to wait and use it as a test case.

Beside that problem, do you already check my first problem (T219851#5163519)? No visible error or warning there. When I click "Publish", it just stuck. But in the console, I found "Uncaught TypeError: Cannot read property 'match' of undefined" like mentioned on the post before. Do you know why this happened and how to resolve it?

NB: You can move this problem to a new task, if you think it is unrelated to this task.

Thanks.

I created a new task for this one (T222882). I was not able to reproduce it at first, but experienced the error later. The exact cause was not obvious after the initial tests, but I think there is enough info in the ticket for engineers to investigate further. Feel free to add any other details you consider relevant in the ticket.

Thanks!

Beside that problem, do you already check my first problem (T219851#5163519)? No visible error or warning there. When I click "Publish", it just stuck. But in the console, I found "Uncaught TypeError: Cannot read property 'match' of undefined" like mentioned on the post before. Do you know why this happened and how to resolve it?

The console error that you see is the reason publishing is stuck. @Pginer-WMF created T222882 to track the error, and I think wrong restoring of sections described in T222905 is the cause. In your previous comments, it seems you have been going back to dashboard and loading the draft translation again. Can you confirm that you had section restored wrongly like described in T222905 when you encountered the error you're talking about? Wrong section restoration means that there is empty section on top, against which some content is restored, like you can see here:

restoring-fail.png (709×1 px, 55 KB)

Hi, @Petar.petkovic. Although the error says something about getUnmodifiedMTPercentageInTranslation, it seems that this error is not related to the increasing MT threshold in idwiki, since it was also happened in eswiki (T221930#5169176). So, I think we can discuss further this problem in the new task created by @Pginer-WMF: T222882.

Since the first set of adjustments were applied in May 1st, these are all translations published by editors not directly involved in this conversation in these 10 days:

Some considerations:

  • These articles were published with different percentages of machine translation, from 0% to 30%.
  • Most (3 out of 4 articles) were started using machine translation, while only one was not using it at all.
  • Initial assessment by a native Indonesian speaker considered the content of these articles to be understandable with only minor issues, making no distinction in the quality among those articles (including the article published with no machine translation at all and the one with 30% of machine translation). Note that the analysis didn't covered the last one in the list since was published more recently.
  • The volume of content produced (4 articles in 10 days) seems manageable for reviewers. Note that there were 25 additional translations by editors directly involved in this conversation that we consider to be motivated by testing the feature, not representing the general translation activity (and we can assume they are good translations).
  • No translation has been deleted so far. This is an indicator of the content quality being acceptable for the community, although it is not always accurate since editors may not have had a chance to review some of these new pages yet, and articles can be deleted or kept by different reasons.

Four articles are too few to generalize these observations. We need to keep observing the content produced, but I wanted to share the first impressions so that anyone can add their own point of view and help us analyze this further.

Thanks!

Recently a user reported, that was prevented to publish a perfectly reviewed translation with still 68% of machine translation. I've asked more details to identify whether that may be caused by the topic being from a particular knowledge area where machine translation requires less changes.

Hi, I am actually having trouble with the threshold too. But, I use the workaround mentioned in T222882#5169992: adding a lot of "test" word in one of the translation paragraph, publish it, then re-edit the published article to delete the "test"s. Since June, only 1 article (Kukusan Bambu) out of 7 total I created using the tool that passed the threshold. You can check it here.

Since June, only 1 article (Kukusan Bambu) out of 7 total I created using the tool that passed the threshold. You can check it here.

Thanks for the feedback. Can you tell us which percentage of unmodified machine translation was preventing you to publish those articles? Based on that we can adjust the thresholds: T228971: Adjust the threshold for Indonesian to prevent publishing when overall unmodified content is higher than 70%

Sorry, but I already forgot the exact unmodified MT percentage for each article. But if I'm correct, it around 40%-50%.

Pginer-WMF claimed this task.

Six months have passed since the last adjustments to the limits. Comparing the six months before (January – June, 2019) and after (Augus 2019 – January 2020), the deletion ratios for translations created with content Translation have decreased significantly, from 13% to 3%, while the deletion ratio for articles created from scratch remained stable at ~8% in both periods:

Before (January – June, 2019)After (Augus 2019 – January 2020)
Published translations1K607
Deleted translations134 (13%)19 (3%)
Published non-translations49.7K42.5K
Deleted non-translations3.9K (7.8%)3.4K (8%)

Given the above numbers and the feedback received during this period, it seems that the current adjustment is preventing a significant number of low-quality translations to be published. So I'll mark the ticket as resolved. If there are other issues in the future, feel free to reopen or create a new one.

Thanks to everyone involved for their collaboration!