Page MenuHomePhabricator

Explore using Google's CLD3 or similar in AbuseFilter to suggest likely input content language
Open, Needs TriagePublic

Description

Sometimes, users will input content in the 'wrong' language compared to the content. Most often this is English-language spam being added to a non-English wiki, but it might also be people not fully understanding that there are e.g. multiple language Wikivoyages and content about Rome in Italian should be in the Italian Wikivoyage, not the English one. It'd be great to provide in AbuseFilter a detected_edit_language variable to compare to wiki_language and perhaps warn the user that their actions might be mistaken. Obviously on some wikis e.g. Commons it's correct to have content in multiple languages, so this should be a feature that wikis can use.

Google's CLD3 library might be a relatively cheap way of assessing likely language of the input?