Localize captcha images
OpenPublic

Description

The captcha software should generate captchas in languages other than English at
non-English projects, depending on the locale. I've seen some generated captchas
at the Vietnamese Wikipedia that would definitely confuse Vietnamese-speakers
(can't remember the words exactly), because of things like r's and n's smooshed
up right next to each other, so it looks like an m, except to an English user
who happens to know a word that has "rn" instead. The user might have to *guess*
because the English words really don't follow Vietnamese spelling rules. We've
recently had users complaining to the sysops of not being able to read captcha
images, presumably for this reason.

An advantage to localizing the captchas would be that it might reduce the impact
of spambots at non-English projects. As far as I know, there isn't yet a
captcha-defeating bot that understands Vietnamese or Basque or Quechua.

For now, I'm only proposing localizing for most languages that use the Latin
alphabet, because requiring users to respond to a captcha in Thai or Arabic
would exclude a lot of legitimate interwiki users. And users of other scripts
tend to have the means of entering in Latin-based characters. Also, for
languages that use diacritical marks, we should generate the words with or
without the marks (not sure which) and modify
[[MediaWiki:Captcha-createaccount]], asking the user to enter in the word
without diacritical marks of any kind.

Once Latin-based alphabets are out of the way, it'd be a good idea to localize
for other writing systems as well, but provide a Latin-based alternative, per
Neil Harris' suggestion [1].

These localized captcha strings should *not* be stored in the MediaWiki:
namespace, nor anywhere easily accessible to the public, because bot writers
could easily write language-aware bots using such information. For wordlists, we
could start by using open-source lexicons, such as OpenOffice.org's [2]. We
should also contact embassadors of non-English projects, asking them for help
compiling sufficiently long lists of their own.

[1] http://mail.wikimedia.org/pipermail/wikien-l/2006-March/042263.html
[2] http://lingucomponent.openoffice.org/spell_dic.html


Version: unspecified
Severity: normal
URL: https://www.mediawiki.org/wiki/CAPTCHA
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=41675
https://bugzilla.wikimedia.org/show_bug.cgi?id=62960
http://code.google.com/p/googlefontdirectory/issues/detail?id=297
https://bugzilla.osafoundation.org/show_bug.cgi?id=13081

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz5309.
mxn created this task.Via LegacyMar 21 2006, 8:41 PM
bzimport added a comment.Via ConduitJun 20 2006, 12:43 PM

mimouni.mohamed wrote:

If the source code of capatcha is one PHP. The functions which to generate my
images with character strings are only in coding ANSI. That wants to say that the
Arab characters for example cannot be poster.

bzimport added a comment.Via ConduitApr 20 2007, 9:00 PM

adziura+wiki wrote:

I think that polish users of Wikipedia wants localized captcha images. Is
better for new users.

bzimport added a comment.Via ConduitMay 1 2008, 9:22 PM

eip wrote:

This would be very useful in Russian Wikipedia too. Of course the words have to be in Cyrillic alphabet.

gpaumier added a comment.Via ConduitDec 28 2009, 9:26 PM

Removed URL since it was not relevant to this bug (probably due to a rebuilding of the archives)

Amire80 added a comment.Via ConduitNov 1 2010, 2:55 PM

I am surprised that it came up only now, but now there is demand for this in the Hebrew Wikipedia, too.

Bawolff added a comment.Via ConduitNov 1 2010, 6:44 PM

Created attachment 7775
(naive) patch to make captcha.py work with unicode

(In reply to comment #1)

If the source code of capatcha is one PHP. The functions which to generate my
images with character strings are only in coding ANSI. That wants to say that the
Arab characters for example cannot be poster.

With some very minor changes to the script this is not true. For example I just generated a bunch of hebrew captchas (just taking random words off the main page of [[he:]]) and some Ukraine captchas (because it was the only non-latin language who had a word list thats just an apt-get away).

My very minor changes included disabling the regex check that words don't match /[^a-z]/. Presumably other languages would need an equivalent checks, and checks to avoid words with diacritical marks (since those would i presume be hard to see in captchas)

p.s. I don't know python, so the very minor changes in my example might not be the "proper python way".

Attached: captcha.diff

Bawolff added a comment.Via ConduitNov 1 2010, 6:53 PM
  • Bug 19229 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitNov 6 2011, 6:28 PM

a.d.bergi wrote:

(In reply to comment #0)

For now, I'm only proposing localizing for most languages that use the Latin
alphabet, because requiring users to respond to a captcha in Thai or Arabic
would exclude a lot of legitimate interwiki users.

We also could use the uselang attribute (and user setting) instead of locale, then this wouldn't be a problem. But I guess the bigger problem then is to find a captcha generator for exotic alphabets.

bzimport added a comment.Via ConduitDec 12 2011, 7:00 PM

xenondwb wrote:

I have an idea how this problem could be solved:

MediaWiki should have an default fund of words, if the wiki doesn't contain enough words (eg. 500 words).
Than everytime a captcha should be displayed, a script fetches a random article and two random words. This words will be in the target language, because the are from articles in the same language as the user wants.
Than a script would place those two words onto an image, make them a bit unreadable and display them to the user.
The user would now have the task to solve the captcha.

But there are some problems:

  • As mentioned: If the Wiki has not enough words, it can't create really random captchas. So, eventually should be included a default fund of words, but this could be a design problem.
  • Also it would be a problem with the non unicode characters. Eventually it should be coded new, instead of using five millions totally different existing solutions and merge them.
  • For big pages this could eventually be a performance problem.

And the biggest problem: It would take some time to create all this new code. Also, I don't know if that would be really better than the existing solution.

And there would be one desing thing: This would be only a good solution for big Wikis, because there it would be hard to predict the selected words in the captcha, like it could eventually be with smaller Wikis.

bzimport added a comment.Via ConduitMay 25 2012, 3:16 AM

sumanah wrote:

Adding i18n keyword,

bzimport added a comment.Via ConduitMay 30 2012, 3:43 PM

everton137 wrote:

Hi, while working for WMF for the Wikipedia Education Program, I've seen a lot of new editors, most of them students, facing a lot of difficulties while editing the CAPTCHA in English.

I think this is a very important issue for Wikipedia in other languages. I've changed its importance to "high".

555 added a comment.Via ConduitJun 1 2012, 2:40 AM

It's a shame that even single implementations are very backlogged.

The developers team really thinks that Vector skin and a WYSIWYG editing interface will be the most relevant to help on editors retention?

Somewhere I've recently said that the language barrier was solved on Wikimedia, resting only the non-Wikipedia projects issue. But unfortunately I was very wrong.

On the bug opening, the Wikimedia paid staff was very small. Now it's a bit larger. But still no single word from any tech-guys, neither the volunteers one...

[[:m:User:555]]

Matanya added a comment.Via ConduitJul 24 2012, 12:59 PM

where is this standing?

Rillke added a comment.Via ConduitJul 26 2012, 11:30 AM

(In reply to comment #12)
Yes, WYSIWYG, article feedback and MoodBar are far more important than some key-issues. The reason? Here it is: Jimmy and the remaining board and Sue are native English speakers so it isn't prioritized. It's not what they are seeing when they are editing Wikipedia. We prefer designing a nice new en.wp main page investing thousands of dollars into questionable campus ambassadors, ...

So even lots of other simple bugs will be never fixed.

Ironholds added a comment.Via ConduitJul 26 2012, 1:56 PM

I'm terribly sorry to see the delay with this :(.

Well, just to be clear, we've not designed a nice new en.wp page - that's a community decision! - and of the 10 board members, half are ESL speakers. Localisation and services to non-enlang projects are things we're focusing more and more on; we've got a dedicated internationalisation team, for example.

On the rest of your examples - I think there's some confusion here as to who does what. Localisation and bug-fixing the "core" software is divided between the internationalisation team and the "Platform" sub-department of Engineering. Things like the visual editor or the feedback tool are the responsibility of the Features Engineering team. So there isn't really one set of things being prioritised by staffers over the other, because they're each handled by different sets of people :).

A more likely issue is that, well, things get lost in Bugzilla :(. Furthermore, there are a lot more bugs than there are developer hours to deal with them - take a look at https://bugzilla.wikimedia.org/weekly-bug-summary.cgi?tops=10&days=365 to see what I mean. Compared to the profile of the software, we really don't have a massive engineering team overall - and that's not down to the board, that's down to our comparatively small budget organisation-wide, which they can't really do anything about.

However! if you'll look above you'll see that Sumana (our awesome Engineering Community Manager) has added the localisation keyword, which should bring this problem to the attention of the localisation team, and I'm going to do my best to make sure they're reached - either to deal with the request, provide some kind of ETA on dealing with it or, if they can't solve the issue, explain what the problem is. They're great people, and I'm confident as both a staffer and a long-term editor that this will get resolved one way or another :).

Solstag added a comment.Via ConduitJul 26 2012, 5:58 PM

Ni!

Thanks for the very informative message Oliver.

I think it is good for us to get really upset when a bug that *affects the
experience of every single new editor
* in many Wikipedias has had no
meaningful progress after 6 years since being reported, despite several
comments here and even face-to-face to staff members.

At the same time, it is important for us to get the fact straight about who is
responsible for what, like you described.

However, "bugs get lost" is also not a good explanation for what goes on here.
It's not even an explanation at all.

There are only 17 Mediawiki bugs with equal or more votes than this one, and
that number only grows to 25 if considering every product on this bugzilla:
https://bugzilla.wikimedia.org/buglist.cgi?votes_type=greaterthaneq&query_format=advanced&list_id=132785&votes=24&resolution=---&resolution=LATER&resolution=DUPLICATE&product=MediaWiki

Some of those 17 don't even count as they are already solved or have equivalent
functionality implemented, but some partial issue keeps them from going away.

Yet some of those are, similar to this one, also in a completely stalled state
for no good reason, despite a lot of people contributing to point out how
important they are and suggest solutions. Red interwiki links is probably my
favorite (Bug #11).

Wikimedia's tech team needs to improve how they prioritize work based
on community input.

And the board is also at fault for not requiring or developing themselves a
clear policy about that.

My impression is that they might be comfortable relying mainly on commissioned
studies of usability and participation, overlooking that most of those are
statistically questionable or based on unrealistic assumptions. Not meaning
they are not useful, they are useful and necessary, just limited. They won't
reveal the whole story by themselves, and sometimes not even the crucial facts.

So here we are, despite continuous community input, six years into a relatively
simple bug that affects every single new editor of Wikipedia in several
languages.

Thanks again Oliver for replying and looking after, and Sumana, now let us
hope the right people get to read this.

Hugs,

Ni!

Ironholds added a comment.Via ConduitJul 26 2012, 6:07 PM

This is actually one of my prime concerns; that we prioritise primarily based on "how big a deal, technically, a bug is" rather than the potential impact on the community. Bugzilla has one metric, and it's largely used for technical importance. But I'm confident the new Bugmeister, whomever they will be, can start making progress in this area :). At the moment we're without a bugmeister completely (which may go some way to explaining how even highly-voted bugs are falling through the cracks, although I appreciate this is older than the bugmeister position).

Nemo_bis added a comment.Via ConduitJul 26 2012, 9:34 PM

Adding bug 32695 as blocker because it might be the solution, by fetching the correct Wikisource.

bzimport added a comment.Via ConduitJul 26 2012, 10:09 PM

sumanah wrote:

(In reply to comment #16)

Wikimedia's tech team needs to improve how they prioritize work based
on community input.

Yes, the WMF absolutely does need to do better at incorporating community input into our work prioritization. Guillaume Paumier, Rob Lanphier, and I presented a talk about this a few weeks ago: https://wikimania2012.wikimedia.org/wiki/Submissions/Transparency_and_collaboration_in_Wikimedia_engineering and I know Oliver and other folks have talked about and worked on it as well, but there's a ways to go.

On that more general topic, I strongly recommend that you join the wikitech-ambassadors mailing list https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors and bring up your concerns from comment # 16, so we can talk about them in a group that includes more community members and Foundation folks, including Guillaume and Rob.

But on this particular issue (localising CAPTCHAs) I'm cc'ing Alolita Sharma, Director of Engineering (Internationalization and R&D), and Siebrand Mazeland, Product Manager of the Localisation team, hoping for their input.

Thanks, Al-Scandar Solstag!

Bawolff added a comment.Via ConduitJul 27 2012, 12:42 PM

Hmm I like the idea of comment 9.

Some issues:
*Swear words - people get angsty when "fuck", etc is in their captcha. (This is probably a minor consideration)
*complex characters - Unicode characters in and of themselves are not a problem. (Some wikis have words not in their native script, but that's the minority, and can be resolved with a "request new captcha") More concerning is Diacritics. Diacritics are small, and may be hard to see when messed with by the captcha algorithm (although a native speaker might know what the word is and be able to fill in the diacritics). I'm doubtful that a captcha of ɓ b will look very different.

However, with that said, perhaps we should just do some testing to see if that's really an issue. Maybe its less of an issue to a non-native speaker than using english captchas are.

*Actual coding - we'd need to be able to generate captchas from php, presumably in real time. Not a major issue, but requires coding efforts. (Or I suppose we could get the word list once, and generate the captchas one off with the current script)


We should also evaluate the effectiveness of our captchas. The captcha program was written a while ago. Since then there's been advances in getting text out of images. Lots of third party wikis report captchas not being all that effective against spam. Perhaps our captchas aren't actually doing anything.

Nemo_bis added a comment.Via ConduitJul 27 2012, 5:22 PM

(In reply to comment #20)

We should also evaluate the effectiveness of our captchas. The captcha program
was written a while ago. Since then there's been advances in getting text out
of images. Lots of third party wikis report captchas not being all that
effective against spam. Perhaps our captchas aren't actually doing anything.

AFAIK it's already proven to be completely broken, see http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html (maybe while implementing the proposed new method we could also get it to use the right dictionaries).
There's quite a chance that our captchas are discouraging only good faith editors, especially non-English speaking.

Pginer-WMF added a comment.Via ConduitAug 3 2012, 6:42 AM

As part of an email conversation related to this topic, I made some mockups to illustrate some captcha ideas that could be less problematic for non-English speakers, improve the general UX, and rely on images from Commons.

Based on tagging parts of a panorama picture with the appropriate word (in the UI language or Basic English words).

Based on finding from a set of similar images the ones that fit a specific criteria (with an image describing also the criteria).

Based on finding the image that is different from a set of images.

These captchas will probably generate new problems for the technical side, require adjustments to reduce the chance of a machine to solve them, or may just be unfeasible to generate, but I wanted to provide these ideas in case anybody else may use it as a base for improve on any technical weakness they may have and make them at least as hard to solve for a machine as text-based captchas are.

A page at Mediawiki has been created to gather ideas and feedback: https://www.mediawiki.org/wiki/Requests_for_comment/CAPTCHA

Bawolff added a comment.Via ConduitAug 3 2012, 11:50 AM

As others have said on the mailing list, I fear such captchas would not only be easier for bots to solve than the current solution (once they've had a little time to adjust), but also would be harder to localize unless the number of such captcha challanges were extremely small.

Nikola_Smolenski added a comment.Via ConduitJan 8 2013, 10:45 AM

Note that some users may not have appropriate keyboard to enter the captcha in their language. Aside from captcha generation in various languages, fuzzy comparison with the answer is needed as well.

Qgil added a comment.Via ConduitApr 1 2013, 8:00 PM

fyi there is a proposal from the Language team for a mentored project about

Multilingual, usable and effective captchas
http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas

I have some reservations about featuring that project to Google Summer of Code or Outreach Program for Women participants, but I'm willing to be proven wrong. Reasons:

  • Unclear buy-in from the community or the maintainers. The whole CAPTCHA topic is messy, with several discussion threads, a RFC, a prototype, and potential plans. We don't have a clear plan for captchas. There hasn't been enough feedback about captchas based purely on images without any text, as this project proposes.
  • Bug 32695 - Review and Deploy Wikicaptcha. Is there and waiting for feedback.
  • I'm not a CS anything and I could be perfectly wrong, but the project feels too ambitious for three months, both with the amount of work required and the skills needed.

With all this I see the risk of failure bigger than wished for a GSOC project, either because students will most likely lack the time/skills or because even a complete GSOC project would have a hard time ending up merged in our codebase.

Feedback welcome.

Bawolff added a comment.Via ConduitApr 1 2013, 9:20 PM

The key word in the gsoc proposal that I like is research. My problem with most captcha proposals is that they promote someone's pet idea without any citations to back up their theory.

This does seem to be much more research oriented than most gsoc projects.

Qgil added a comment.Via ConduitApr 2 2013, 11:33 PM

Sure, research is great. But before proposing someone to do a 3 month research on this subject I would like to have confidence that this research is welcome and there is an interest from the MediaWiki / ConfirmEdit maintainers in changing the status quo.

Reading the feedback in various channels it is easier to find a disbelief on captchas as a solution altogether.

bzimport added a comment.Via ConduitMar 20 2014, 8:25 PM

aalekh1993 wrote:

Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.

Qgil added a comment.Via ConduitMar 22 2014, 6:14 PM

Let's move the GSoC 2014 discussion to

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

gerritbot added a comment.Via ConduitMar 29 2014, 8:02 AM

Change 121255 had a related patch set uploaded by Nemo bis:
Make captcha.py produce images in arbitrary language

https://gerrit.wikimedia.org/r/121255

Nemo_bis added a comment.Via ConduitMar 29 2014, 8:33 AM

Plans for the ultimate solution are being discussed at bug 62960.

In the meanwhile, as workaround, we're testing making images in all languages with words taken from Wiktionary. For technical details please read and comment on https://gerrit.wikimedia.org/r/121255
You can see samples at https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas

In my testing the images seem rather good, mainly depending on the availability of a good font. DejaVu is a well known high quality font covering most languages and DejaVuSans-Bold seems to work well for the languages it covers: https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.txt

Caveats:

  • We still have to handle RTL languages. Results probably don't make any sense now.
  • We've not yet made the blacklist multilingual but it's not too hard, ignore the bad words if any.
  • We still have to figure out how to exclude confusable words. It's not impossible, there is a Unicode library for that (but not for python perhaps). See bug 63216.
  • Of 165 languages for which Amgine gave me "big" dictionaries, 20 were not in DejaVu and for 10 I used FreeSerif instead. Those are lower quality. We may end up using [[mw:ULS]] font repo with some hacks, if many languages need it; or we could just skip them: I wonder if a captcha in e.g. Gujarati or Japanese will ever make sense.
  • Security fixes demanded by http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf will be in a separate patch. They're several small things that someone familiar with PIL can do easily enough in the existing code. One of them is "printing" each letter separately with some aspect variations, which may solve some problems with ligatures too.
whym added a comment.Via ConduitMar 30 2014, 10:16 AM

(In reply to Nemo from comment #32)

Plans for the ultimate solution are being discussed at bug 62960.

In the meanwhile, as workaround, we're testing making images in all
languages with words taken from Wiktionary. For technical details please
read and comment on https://gerrit.wikimedia.org/r/121255
You can see samples at
https://www.dropbox.com/sh/i2af7xvn4y593gc/-RRtFyoJji/captchas

In my testing the images seem rather good, mainly depending on the
availability of a good font. DejaVu is a well known high quality font
covering most languages and DejaVuSans-Bold seems to work well for the
languages it covers:
https://sourceforge.net/p/dejavu/code/HEAD/tree/trunk/dejavu-fonts/langcover.
txt

I think zh-* (Chinese) variants are mistakenly included. They are not claimed to be covered by the font, and many substitute squares (tofus) appear in your samples.

Nemo_bis added a comment.Via ConduitMar 30 2014, 10:29 AM

(In reply to Yusuke Matsubara from comment #33)

I think zh-* (Chinese) variants are mistakenly included.

Indeed; deleted. If someone thinks a captcha in CJK locales makes sense and/or has ideas on how to support them, please share.

Nemo_bis added a comment.Via ConduitMar 30 2014, 11:01 AM

(In reply to Siebrand Mazeland from comment #35)

ISO code got does not make sense, tofu at
https://www.dropbox.com/sh/i2af7xvn4y593gc/a6Kz0eSXZ4/captchas/got#f:
image_5edb52ac_e04341dd3d25c8f8.png

Right. Maybe https://www.gnu.org/software/freefont/coverage.html lies? I'm getting more and more inclined to only use DejaVu. For the languages it doesn't support we'd need to ensure native speakers like the font (e.g. by using ULS fonts) but it's also quite hard to design image distortions that make sense with those scripts.
If you know one of the following languages please speak up!

  • bn Bengali
  • chr Cherokee
  • gu Gujarati
  • hi Hindi (Devanagari script)
  • mr Marathi (Devanagari script)
  • sa Sanskrit (Devanagari script)
  • ml Malayalam
  • si Sinhala/Sinhalese
  • ta Tamil
  • th Thai 1%

Missing in FreeFont too:

  • am Amharic
  • bo Tibetan
  • ja Japanese
  • km Central Khmer
  • kn Kannada
  • ko Korean
  • my Burmese (Myanmar)
  • pa Panjabi/Punjabi
  • te Telugu
  • ug Uyghur 87%
  • ur Urdu 92%
Nemo_bis added a comment.Via ConduitMar 30 2014, 11:07 AM

Sorry for double message; another idea I had is that some of those languages don't have an OCR, as Wikisource folks painfully know (for instance Malayam). Maybe for such languages we could just disable distortions, given bots are unlikely to parse them on their own anyway.
Cf. http://finereader.abbyy.com/recognition_languages/

NiharikaKohli added a comment.Via ConduitMar 30 2014, 12:41 PM

I went through the pictures for CAPTCHAs in Hindi. They're mostly understandable except for in a few of the images it's impossible to distinguish the character. Hindi has quite a few similar-looking characters differing just by a small line or a dot.

For example, the middle character is not-recognizable in https://www.dropbox.com/sh/i2af7xvn4y593gc/050a6S-21C/captchas/hi#lh:null-image_76947daa_e5d5575a79755d28.png

But mostly they read just fine.

Mormegil added a comment.Via ConduitMar 30 2014, 1:40 PM

I must say the Czech (cs) version is better than I’d expect. The only issue seems to be diacritics: especially the difference between i/í is practically indistinguishable after the distortion. For most words, you can probably tell from context, but in some cases, both versions would make correct words (e.g. https://www.dropbox.com/sh/i2af7xvn4y593gc/bSYQyGEMBH/captchas/cs#lh:null-image_6d80659d_e7c8421a61605559.png can be both “dobyti” and “dobytí”). Removing all words with “í” would probably be enough, ignoring the difference between “í” and “i” would be perfect, but I guess having some (low) nonzero expected error rate would be acceptable as well.

mxn added a comment.Via ConduitApr 1 2014, 8:35 AM

For Vietnamese, 27 of the images contain a piece of tofu instead of a second word; 2 images contain more than one piece. It’s odd, because this font clearly supports the Vietnamese half of Latin Extended Additional. The high distortion is problematic and probably unnecessary, because Vietnamese OCR is still pretty rudimentary, with little support for diacritics. As it is, though, a different font may help with many of the following legibility challenges:

ú or ủ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/v_jkVCy5Xg/captchas/vi/image_6876ca11_cc6e08a95ea5b935.png

ẽ or ế?
https://www.dropbox.com/sh/i2af7xvn4y593gc/rfc7TwizAo/captchas/vi/image_432bfc9d_d02d9707bcb0a02b.png

If I didn’t know this font used two-story a’s, I’d see ã instead of ỗ:
https://www.dropbox.com/sh/i2af7xvn4y593gc/CGUSde4hfC/captchas/vi/image_5cbb4b12_976cd14e4e332a23.png

ú or ứ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Svdiq4ZLS5/captchas/vi/image_c90d4c3d_6b4e7a877b3e79dc.png

d or đ?
https://www.dropbox.com/sh/i2af7xvn4y593gc/Ah4yviImWT/captchas/vi/image_ae9020dd_0a618ab7494104fd.png

Nemo_bis added a comment.Via ConduitApr 1 2014, 10:03 AM

Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the dictionary. As with Malayalam issues reported on mailing list, I'm unsure how to handle such "extraneous" "words" for all languages; though in this and the Serbian's case we could "just" check the dictionary is in the main language's script (if we know the language code...).

About vi, I was reading earlier this morning on Gentium: «version of the font with redesigned diacritics (flatter ones) to make it more suitable for use with stacking diacritics, and for languages such as Vietnamese». http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium_faq&_sc=1#5d25a5da
How many languages have such complex diacritics and is there some generic enough font? I doubt we can exclude words with diacritics, we'd only have 500 left out of thousands in vi's case. I'm uploading a new attempt with Arimo font, please check if it's any better.

mxn added a comment.Via ConduitApr 1 2014, 10:21 AM

(In reply to Nemo from comment #41)

Tofu is because of things like [[wikt:裘]] and [[wikt:意見]] being in the
dictionary. As with Malayalam issues reported on mailing list, I'm unsure
how to handle such "extraneous" "words" for all languages; though in this
and the Serbian's case we could "just" check the dictionary is in the main
language's script (if we know the language code...).

Yep, that’s what’s required for Vietnamese then.

About vi, I was reading earlier this morning on Gentium: «version of the
font with redesigned diacritics (flatter ones) to make it more suitable for
use with stacking diacritics, and for languages such as Vietnamese».
<http://scripts.sil.org/cms/scripts/page.
php?item_id=Gentium_faq&_sc=1#5d25a5da>
How many languages have such complex diacritics and is there some generic
enough font?

Among Latin alphabets that we’ll be displaying, Vietnamese is a bit of a special case for stacking diacritics. GentiumAlt’s flatter diacritics allow it to fit Vietnamese on a standard-height line at the cost of some legibility. If anything, we need more exaggerated diacritics that can survive the distortions.

I doubt we can exclude words with diacritics, we'd only have
500 left out of thousands in vi's case.

Right, the whole point of this exercise is to include the diacritics. :-)

bzimport added a comment.Via ConduitApr 1 2014, 1:49 PM

wmf.amgine3691 wrote:

Comments regarding sinitic captcha's in third paragraph of this revision: https://en.wiktionary.org/w/index.php?title=User_talk%3AWyang&action=historysubmit&diff=26075006&oldid=26066452

Nasirkhan added a comment.Via ConduitApr 2 2014, 9:03 AM

Hi,
Bengali (bn) text are not displaying properly. All the conjunctions are misplaced and that is why almost none of the image represents any word. In some images (Example:https://www.dropbox.com/sh/i2af7xvn4y593gc/7fTaoyiaSb/captchas/bn#lh:null-image_a060ec4f_d15f04bc689bb980.png) parts of the characters are missing because of the padding/border.

I am not sure it is a problem of the font or not. But if can tell me the name of the font i can test that.

Nasir Khan Saikat

mxn added a comment.Via ConduitApr 2 2014, 9:11 AM

(In reply to Nemo from comment #41)

I'm uploading a new attempt with
Arimo font, please check if it's any better.

Yes, it’s better. The only severe ambiguity I ran into was:

h or n? Knowing the word, it’s n, but it sure looks like h:
https://www.dropbox.com/sh/i2af7xvn4y593gc/-GesxDHeX9/captchas/vi-arimo/image_03be064f_70c0338194b8dca2.png

Another issue for Vietnamese: the ̃ and ̉ diacritics can look like each other when stacked over ̂ and distorted. The southern dialect merges the two tones into ̉, so southerners won’t always be able to rely on the words they know to resolve the ambiguity. I’ve asked the Vietnamese Wikipedia community for feedback on this issue: [[vi:Wikipedia:Thảo luận#Việt hóa các hình CAPTCHA]].

Finally, many Vietnamese Wikipedia users rely on an IME script embedded via a gadget, but gadgets are disabled at [[Special:UserLogin/signup]]. We’d need to port the (rather complex) IME to ULS to keep the signup form accessible. Otherwise, as others have mentioned on the mailing lists, there will have to be an option to fall back to an English CAPTCHA.

Nikola_Smolenski added a comment.Via ConduitApr 2 2014, 11:19 AM

Suggestion regarding Bengali and similar: they do not have to be distorted as much. This because OCR for these scripts is less developed than OCR for Latin alphabet, and I doubt spammers will be willing to bother so much for relatively small Wikipedias. If we notice that the captchas are being ignored, more distortion could be added.

bzimport added a comment.Via ConduitApr 2 2014, 4:21 PM

sumanah wrote:

Also see comments at http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/thread.html#644 about Swedish, French, Bengali, Romanian, and Catalan.

zeljkofilipin added a comment.Via ConduitApr 3 2014, 8:47 AM

Croatian (hr) is _completely_ broken. For example, 80% of them (or so) is completely (or half) in Cyrillic. Some older people will be able to read it, but almost nobody in Croatia will be able to enter Cyrillic text, since that is not an official script here.

zeljkofilipin added a comment.Via ConduitApr 3 2014, 8:55 AM

Serbian (sr) is strange too. I think both Latin and Cyrillic are official there, but isn't it strange to ask for people to change input method (from Latin to Cyrillic, and vice versa) in the middle of captcha, like here?

https://www.dropbox.com/sh/i2af7xvn4y593gc/ocKv1yBPuf/captchas/sr#lh:null-image_e373536e_2ecbd37b76d67185.png

The above is not the only example.

zeljkofilipin added a comment.Via ConduitApr 3 2014, 9:08 AM

Bosnian (ba) has completely Cyrillic CAPTCHAs, as far as I can see, but according to Wikipedia "Standard Bosnian uses a Latin alphabet."[1]

Željko

1: https://en.wikipedia.org/wiki/Bosnian_language

Nemo_bis added a comment.Via ConduitApr 3 2014, 9:19 AM

ba is not Bosnian https://translatewiki.net/wiki/Portal:Ba

Thanks for these comments, but we're already aware of the mixed/wrong script issues: it was the first thing people brought to our knowledge, no need for more examples.
http://thread.gmane.org/gmane.org.wikimedia.mediawiki.i18n/846

As previously said (see comment 41), we'll rely on the ICU interface to Unicode data to remove mixed script and (where possible) secondary/wrong scripts for each language. Problems with the source dictionary (en.wiktionary.org) should be dealt by editing said wiki.

mxn added a comment.Via ConduitApr 3 2014, 11:03 AM

So far, the general sentiment from the Vietnamese Wikipedia community has been that the added difficulty of distinguishing diacritics vastly outweighs any readability improvements from using actual Vietnamese words instead of English words or random letters. Moreover, there is skepticism that the wiki even has a problem with CAPTCHA-solving bots. These are gut feelings rather than hard data, of course, but I can imagine a couple changes that would mitigate the community's concerns:

1a. Minimize or eliminate distortions in Vietnamese. High-quality OCR solutions like Google's already have enough difficulty with clear, undistorted Vietnamese text.
1b. Alternatively, strip diacritics *before* display and accept diacritic-less input. There would likely be no change in difficulty for bots, but Vietnamese users would still be able to employ their knowledge of Vietnamese spelling patterns.

  1. Provide an option to solve a standard English CAPTCHA. (Not sure what the default should be.) Many websites that require CAPTCHAs offer some alternative for accessibility; Vietnamese CAPTCHAs with diacritics would be insurmountable to those with declining eyesight.
Nullzero added a comment.Via ConduitApr 7 2014, 1:51 PM

IMHO, for Thai language, the pictures are very blurred. Although some can be guessed easily, the rest needs a lot of effort. In some cases, it is impossible to determine the correct word at all.

Nullzero added a comment.Via ConduitApr 8 2014, 3:24 PM

Results from [[th:WP:HELPDESK#CAPTCHA]] from Thai Wikipedia: S: 0, O: 6, N: 0

Comments:

Nullzero: See the above comment

G(x): Too hard too read

Taweetham: (1) Too hard to read (2) Contain swearing words (3) Not convenient for interwiki users (4) Thai language is complex. He doesn't know whether the software will generate words which are impossible to enter or not

BlackKoro: Unable to read

Lerdsuwa: Can't distinguish between "ท" and "ห", "ล" and "ส"

Aristitleism: (1) Very hard to read (2) Contain swearing words (3) Contain some obsolete characters which no one uses anymore such as "ฦ" It is also hard to find these obsolete characters on Thai keyboard.

RandomDSdevel added a comment.Via ConduitMay 2 2014, 6:45 PM

(In reply to Minh Nguyễn from comment #52)

Moreover, there is skepticism that the wiki even has a problem with CAPTCHA-
solving bots. These are gut feelings rather than hard data, of course, but I
can imagine a couple changes that would mitigate the community's concerns:

Could some hard data be found on whether or not the Vietnamese Wikipedia has ever had any problems with CAPTCHA-solving bots?

He7d3r awarded a token.Via WebNov 24 2014, 12:03 PM
zeljkofilipin removed a subscriber: zeljkofilipin.Via WebDec 3 2014, 12:37 PM
Ricordisamoa added a subscriber: Ricordisamoa.Via WebMar 15 2015, 2:36 PM
Glaisher added a subscriber: Glaisher.Via WebApr 21 2015, 4:28 PM

Add Comment