Page MenuHomePhabricator

Accept CAPTCHA responses with diacritics removed
Open, NormalPublic

Description

With a fix for bug 5309, such as the one discussed at https://gerrit.wikimedia.org/r/121255/, it’s entirely possible that a user might get a CAPTCHA with illegible diacritics. Diacritics in Latin alphabets can look identical to one another when distorted, for example i í ì ỉ, or ó ơ.

For better usability, ConfirmEdit should display a CAPTCHA containing diacritics but require the user to enter the characters without diacritics. There’s a third-party module called Unidecode that does a decent job of accent folding.

One tradeoff would be that such CAPTCHAs might be easier for a bot to crack. There’s also the issue that a character like Ê might be considered a base letter in one language (as in Vietnamese) but a letter with a diacritic in another (Portuguese).


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=63217
https://github.com/mitsuhiko/babel/issues/89

Details

Reference
bz63216

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz63216.
bzimport added a subscriber: Unknown Object (MLST).
mxn created this task.Mar 28 2014, 8:36 AM

I'm not sure about the "only" part: for usability it's better if the system is completely agnostic to details, or I may correctly enter all diacritics and have my solution rejected for no reason.

When implementing this we're probably going to use some standard Unicode solution for case folding and diacritics/accent folding.

Yes, this is absolutely necessary. Not only diacritics might not be visible, but also some users may not have the keyboard to enter them.

I am not sure how to implement the folding, and it may even be language-dependent. For example, users may enter 'ö' as 'o' or as 'oe', or 'đ' as 'đ', 'ð', 'd' or 'dj'. A possibility is to simply avoid words with diacritics, which should be possible for most languages.

In future, when non-Latin captchas are implemented, the same should apply to alphabets (f.e. it should be possible to enter a Cyrillic captcha in Latin alphabet).

mxn added a comment.Aug 28 2016, 9:59 PM

A possibility is to simply avoid words with diacritics, which should be possible for most languages.

That probably will help with Western European languages. Unfortunately it’d also exclude the vast majority of Vietnamese, leaving a word list too short to serve as an effective CAPTCHA word list.

Libraries like ICU have diacritic-folding facilities that should make it possible to accept o for ở and both o and oe for ö, as long as the source language is known (which it is in this case). Other scripts would be supported to some extent, though transliteration is a far messier problem than diacritic folding.

Restricted Application added a subscriber: Florian. · View Herald TranscriptAug 28 2016, 9:59 PM
In T65216#2589692, @mxn wrote:

Libraries like ICU have diacritic-folding facilities that should make it possible to accept o for ở and both o and oe for ö

Yes please. (Can we update the task summary? That "only" makes me shudder.)

Platonides renamed this task from Only accept CAPTCHA responses with diacritics removed to Accept CAPTCHA responses with diacritics removed.Apr 18 2017, 12:21 AM