Page MenuHomePhabricator

Disable machine translation in Content Translation Tool on Lithuanian Wikipedia
Open, MediumPublic

Description

Hello,

Our community has decided that machine translation in Content Translation Tool should be completely disabled.

Hopefully, our decision will not be overridden.

Event Timeline

ngkountas triaged this task as Medium priority.Thu, May 9, 8:25 AM
ngkountas moved this task from Needs Triage to MT on the ContentTranslation board.

Thanks for sharing the request. With disabling requests we want to be especially cautious. While some editors may find issues with the feature, being a frequently used feature suggests that other editors may find it useful.
We are happy to adjust the tool in the way that it better serves the whole community.

To add some context I'll share some data. Data never tells the whole picture, but with your knowledge of the Lithuanian community we can get a better understanding from it (and be helpful as a reference after making any adjustment).

Lithuanian Wikipedia is one of the few cases where the articles created with Content Translation are deleted more often that those started from scratch without using the tool. For example, during last year (2023), 28% of the articles created with Content Translation were deleted, while only 5% of the articles created from scratch were deleted. This is not the case for most wikis, and it would be great to understand which particular factors may influence into this (by the request, I assume that the low quality of machine translation may be one, but would be great to hear more details on this).

Looking at the recent activity (last 2 years) I'v enoticed a particular spike of activity in February 2023 with over 100 translations where the regular number of translated articles in other months is usually in the 20-30 range. Does anyone know what may have caused this spike? Were any big events/campaigns happening at the time?

translations-across-all-wikis-2024-05-09T09-37-33.342Z.jpg (384×774 px, 73 KB)

Looking at the distribution by user edit count, was driven by experienced users ( users with more than 10K edits).

monthly-translations-by-user-edit-count-bucket-2024-05-09T09-35-29.614Z.jpg (376×815 px, 81 KB)

Based on the above, it makes sense to apply some adjustments. We can consider:

  • Evaluate MinT to check if it provides a better translation quality than the current default translation service (Google). You can check translaitons in the MinT test instance and read more about MinT the project page. For Lithuanian, MinT uses the NLLB-200 model which has provided better results for some communities such as Icelandic but it may or may not be the case of Lithuanian. Any input on how MinT works for Lithuanian, how it compares to Google Translate, and whether the quality is good enough to make a potential difference to avoid low-quality translations will be appreciated.
  • Prevent misuse during events. If we identify that low-quality content is generated as part of contests or similar events, we can try to improve how the tool supports those cases instead of disabling features for all users across the board. We are working on an approach to limit the fast creation of unreviewed translations (T331023) that can help in this context. We can pilot this for Lithuanian if the community thinks t may be helpful for their specific issues (is there a sense on whether the low quality translations come form a small group of users producing translations fast in a short amounts of time?).
  • Restrict access to machine translation. It would be very useful to understand the effect of encouraging users to do more manual translations. I'd recommend to do so gradually to minimize the impact on those making good use of the tool and learn more with each specific change. Some of the steps could be make the limit system more strict. This allows to enforce users to modify automatic translations a set percentage at least. This could be adjusted form accepting 100% of unedited machie translation to 0% (equivalent to forcing users to write content from scratch without machine translation). We can also consider providing machine translation as non-default to communicate their usage is discouraged.

Feel free to share any additional details which will be very useful to support not only Lithuanian but also other communities that may be in similar situations.
Thanks!

@Pginer-WMF: Thanks for a more detailed look into this. Let me provide some context behind this decision of the community:

  • The vast majority of articles produced using the Content Translation tool are of very poor quality. They are mostly created by new users who put little to no effort in making the content meaningful. It's highly unlikely to be a training or education problem: usually such auto-translated articles are posted without any editing or constructive follow-ups. They are arguably "one-day editors", who just played with the UI and walked away. Some try to flood a bunch of articles (hence your >5 edit count), without any meaningful effort to edit them. As far as I know, we currently have one editor who has been producing good quality articles using the Content Translation tool. @Homo_ergaster can better comment on the spikes and patterns.
  • The value of such auto-translated articles is generally very low, because: 1) they are not coherent, often have broken structure of the sentences and wildly off terminology; 2) it often takes less time to rewrite such articles than fix them; 3) other editors and administrators waste time deleting such articles or debating whether they can still be salvaged. We have {{hopeless}} (literally) template for them.
  • As you noted yourself, the data doesn't show the whole picture. I can provide several examples why the data may be skewed: For example: 1) some articles remain because other editors put an effort to rewrite them; 2) some articles remain because they get cut down to a stub (just a few sentences); 3) some are marked as "hopeless" and just stay like this for years. After all, it depends on the editors and administrators who volunteer to do something about them. Some editors observed that in other wikis there are a lot of auto-translated articles which barely make any sense. Therefore, the lower deletion ratio doesn't necessarily mean better result by the Content Translation: it may merely mean less proactive effort by that community. In other words, a bunch of barely meaningful text produced in the WPs just stays there and nobody cares. It is especially the case with small WP communities.

The request is not necessarily about complete removal of the tool. However, encouraging new users to use it is counter-productive and we argue that it should at least be hidden or disabled by default. We can still retain the option to enable the tool in the "Preferences -> Appearance" section. It would actually be useful for our sole productive user of the tool. In general, those users who can contribute meaningful content are also more likely to figure out how to enable some tools.

P.S. From a quick glance, for the sample text, the NLLB-200 model produces significantly worse result than Google Translate (it didn't even spell "Jazz" correctly).

Some auto-translated articles are created by foreigners who often write about their villages, local celebrities and so on. I'm not sure about the spike you mentioned, but it's likely that a significant part of the articles were written by users who don't speak Lithuanian. Also, I'm under the impression that many auto-translated articles are written by children or teenagers who just play with the UI and don't have any interest in sticking around and actually learning something.

Indeed, we pretty much have only one editor who has been producing good enough articles using the Content Translation tool. Given the circumstances, it would be reasonable if the tool was disabled by default.

And even that user (https://lt.wikipedia.org/wiki/Naudotojas:ArunasG) rewrites a lot of text before posting it (compare https://en.wikipedia.org/w/index.php?oldid=1211409903 and https://lt.wikipedia.org/w/index.php?title=Messerschmitt_Me_321_Gigant&oldid=7267980), then edits it a lot, and the result is very average (plenty of syntax, hyperlink errors which others should edit).

The translator is useful only in two cases: when you have very similar languages and need to change text a little (I used it to translate articles from Lithuanian to Samogitian (:sgs:/bat-smg) which is a dialect or a very close language) but Lithuanian has no bigger similar languages (English, Russian, German, Spanish, all the big languages from which one could translate articles are significantly different) and closest one, Latvian, is quite different, an rarely anyone speaks it in Lithuania. Second case is importing any sort of big data sets like tables, templates where you want to copy formatting and translated hyperlinks. But even that doesn't work good - it is impossible to transfer only a part of an article which only contains the data set, also, it doesn't work if your article is in List namespace.

To sum it, the translator could be sometimes useful but only to experienced users and it helps nothing for new/random users. As I monitor many small wikipedias I see that it is mostly used by spammers who want to promote things in languages they don't speak, or some irresponsible users who want to boost their wikipedias article count. I had to delete bunch of such bad autotranslations from Guarani Wikipedia (:gn).