Page MenuHomePhabricator

Prototype CAPTCHA optimized for multilingual and mobile
Open, NormalPublic

Description

This is an enhancement request related to https://www.mediawiki.org/wiki/Special:PermanentLink/1151094#Multilingual.2C_usable_and_effective_captchas

There has been a lot of discussion about the effectiveness of captchas and the amount of investment we should put on this area. However, the fact is that captchas are deployed in Wikimedia projects, and they are used for combating spam and other misuses on a daily basis. It is also a fact that the current approach (text based) is not optimal for desktop, and clearly problematic for mobile, where reading and typing becomes more difficult. Also, the current model is based in English language / Latin text string, which puts a majority of users in disadvantage.

For all these reasons it is worth investigating a way forward, keeping the current approach of using captchas. (It is also worth considering full alternatives departing from the CAPTCHA techniques, but please discuss them elsewhere)

Any proposal in this direction should comply with these requirements:

  • A clear solution for sourcing CAPTCHA context automatically from big pools of free text/files. The system proposed cannot rely on manual selection or other types of extra human work.
  • Non-discriminatory to users depending on their language.
  • Usable in a mobile context.

Version: unspecified
Severity: enhancement
URL: https://www.mediawiki.org/wiki/CAPTCHA
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=32695
https://bugzilla.wikimedia.org/show_bug.cgi?id=5309
https://github.com/mitsuhiko/babel/issues/89

Details

Reference
bz62960

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 2:54 AM
bzimport set Reference to bz62960.
bzimport added a subscriber: Unknown Object (MLST).
Qgil created this task.Mar 22 2014, 6:12 PM

aalekh1993 wrote:

Hello,

Complying to the above points mentioned I would like propose following approaches which are mentioned in my proposal also:

  • "Captcha for identifying Odd one out", "Image Rotation Based Captcha" and Effects produced by using php library's ,can help us in combating spam bots with success rate above 25%, primarly because easy logical question are likely to be solved by humans but not by bots, Image rotation based authenticity is best described in this resarch paper: http://googleresearch.blogspot.in/2009/04/socially-adjusted-captchas.html .

.Also wisely choosing effects have provided resistance over large image databases like tineye and google images (I came out to this conclusion by experimenting with images).

*My idea of captcha is to extract images from Wikidata requiring no extra human efforts , in addition providing text based captcha from free/text file would not improve much in user experience, as compared to present captcha (which is really frustrating).

*In my proposal I proposed building an indexing system which would provide users with globally acceptable images, also this indexing system is designed to improve with overtime performance.Additional localization can be performed by using Apertium machine translation api: https://www.apertium.org

*"Clipart based Captcha" and "Image based captcha" would be a big advantage in mobile devices since "text based captcha" is difficult to input and verify using mobile keyboard.

All mentioned names of approaches of captcha are explained in detail in my proposal here : https://www.mediawiki.org/wiki/User:AalekhN/GSoC_proposal_2014

Your Suggestions and questions are welcome :)

wmf.amgine3691 wrote:

First, this is a subject near and dear to my heart, and I've produced a couple captchas for ConfirmEdit over the years, and tried every existing module. The top three WMF modules, imo, are:

  • Asirra is currently the best, bar none. On one of the properties I managed which had reasonably high traffic, and which had been bot honey pot at one point (which was why it ended up under my management) was experiencing approximately 300 asirra failures per hour, for 3+ years, with 7 incidence of spam - of which 6 were humans and the last we couldn't determine if it were human or bot. However, it exposes users to MicroSoft's data collection without an informed consent.
  • Questy captcha is second, beating recaptcha. When the question pool is regularly cycled with locally unique questions this system is excellent. Its drawbacks are, in order of highest to lower cost, the cultural/language specificity of easily answered questions, the maintenance cost in generating unique local questions, and the lack of analysis of failures (which questions fail most often, which questions are answered by spambots, etc.)
  • ReCaptcha, by Google, which falls into the 24% failure rate class. In addition to its high failure rate, ReCaptcha scripts 'donate' cpu cycles from users to processing Google's book scanning efforts, as well as expose users to Google's data collection processes, neither with informed consent.

General comments regarding the GSOC proposal:

  • None of these modules appear to have considered accessibility prior to the implementation. Particularly for people with visual impairments which do not require audio options, 'effects' type modules may present an inordinately high challenge.
  • For the 'odd one out' style, consider a text such as "Select the [category]" or "Select all that are not [category]". The category will almost always be a plural noun, which makes the message more easily translated. The message "Select the images that is different in this group" is ambiguous, unless there is only one image which is different.
  • In general, I would suggest you avoid clip art. While clip art attempts to be trans-lingual and -cultural, it often is not. E.g. a steadily reducing population has ever used a wood-and-graphite pencil; many cultures have never had them.
  • While I like the image rotation conceptually, too many images may have more than one appropriate 'up'. Is a plane 'up' if it flying horizontally ( -> ), or if it is pointing at the top of the screen ( ∆ )? Imagine this character pointing any direction: ✈︎ (AIRPLANE Unicode: U+2708 U+FE0E, UTF-8: E2 9C 88 EF B8 8E) Indexing enough images to be viable is possible, but expensive.

I would suggest you focus on a two-part module:

  • A categorical odd one out, with indexed ('tagged') images drawn from Commons (possibly via Wikidata api, about which I know nothing.)
  • An image tagger ("Please give three categories for this image:") with images drawn from random or targeted categories on Commons, which will be used to expand your pool of indexed images.

I would strongly encourage you to plan your module to be implemented on 3rd party wikis. Figure out how to get them to build indexes for WMF, because there are far more mediawiki users outside of WMF than inside it.

Nice write-up, thank you! On Asirra, how long since you last managed a wiki using it? It seems there may have been developments in the ability to beat it, along the lines of the original papers proposing it and of http://dx.doi.org/10.1145/1455770.1455838 .

shashank_jaiswal wrote:

As per the above points :
I have proposed the idea on https://www.mediawiki.org/wiki/User:Shashank2016/Image_puzzle_CAPTCHA

  • An arbitary(or properly indexed/sorted)image taken from mediaWiki Database, split into (n x m) matrix using GD and then images are rotated and then set as Captcha image. A spam bot may need to validation it for (n*m)! times. Incase if n>3 & m>3 => Total number of combinations possible would be more than (3*3)! = 362880 clearly image captcha Validation for more than 362880 times.
  • The CAPTCHA is automatically created from big pools of image files from mediawiki. There won't be any manual selection of images or other types of extra human work.
  • Proper indexing of Image files from mediawiki can be done which can be region based local image if required.
  • Yes it's Non-discriminatory to users depending on their language.(Multilingual)
  • Usable in a mobile context: This can be pretty much interactive for touchscreen phone users. For other phones we can go with simpler ones like "Simple Image based Captcha with numbers or alpha-Numeric multi-Case character written on it along with some lines and similar effects using GD".

Thanks.

wmf.amgine3691 wrote:

(In reply to Nemo from comment #3)

Nice write-up, thank you! On Asirra, how long since you last managed a wiki
using it? It seems there may have been developments in the ability to beat
it, along the lines of the original papers proposing it and of
http://dx.doi.org/10.1145/1455770.1455838 .

I am currently managing 5 mediawikis, all using the WMF ConfirmEdit Asirra module. None of these are high-traffic, none have a history of being over-run by bots. My estimation of spam attempts per hour are based on accidentally unprotecting one wiki during a server migration last month - about 5 per hour. Currently, zero spam incidents using Asirra in the past year.

wmf.amgine3691 wrote:

(In reply to shashank jaiswal from comment #4)

As per the above points :
I have proposed the idea on
https://www.mediawiki.org/wiki/User:Shashank2016/Image_puzzle_CAPTCHA

  • An arbitary(or properly indexed/sorted)image taken from mediaWiki

Database, split into (n x m) matrix using GD and then images are rotated and
then set as Captcha image. A spam bot may need to validation it for (n*m)!
times. Incase if n>3 & m>3 => Total number of combinations possible would be
more than (3*3)! = 362880 clearly image captcha Validation for more than
362880 times.

  • The CAPTCHA is automatically created from big pools of image files from

mediawiki. There won't be any manual selection of images or other types of
extra human work.

  • Proper indexing of Image files from mediawiki can be done which can be

region based local image if required.

  • Yes it's Non-discriminatory to users depending on their

language.(Multilingual)

  • Usable in a mobile context: This can be pretty much interactive for touchscreen phone users. For other phones we can go with simpler ones like "Simple Image based

Captcha with numbers or alpha-Numeric multi-Case character written on it
along with some lines and similar effects using GD".
Thanks.

js-based game captchas (Are You A Human) are most-easily solved by reverse-engineering the js and determining how it decides you've solved it. XRumer includes several modules to solve a few hundred of them. The other method is to hire out the game-playing to subcontractors.

With the image rotations, you will need to determine which images with known humans have a high rate of failure to identify the correct orientation, and remove them from the pool. As far as I am aware this quality assurance step cannot be automated. (One possible method for distributing this task is to ask all logins to optionally identify 'up' for an image, and those which are auto-confirmed accounts and choose to answer can be assumed to be human. Figuring out how to do this task before and after user login should be quite the challenge.)

Likewise, I believe you will need to add quality monitoring to any other captcha method. Selecting random images from the millions of files available on Commons may seem like an easy solution, but it will result in contextually inneffective problems: the goal of the captcha is to let humans in easily, not to keep bots out. If the humans do not get through easily, the site will not be used and the purpose for the captcha is lost.

The captcha may be non-discriminatory based on language, but does it work for someone who cannot see a full spectrum of color? Almost all photo effects which cause distortion of the initial image will be exceptionally difficult for people with even mild levels of dyslexia. I can introduce you to several wikimedians with degrees of visual impairment who will be unable to solve several of your proposed captchas.

(In reply to Amgine from comment #5)

I am currently managing 5 mediawikis, all using the WMF ConfirmEdit Asirra
module. None of these are high-traffic, none have a history of being
over-run by bots. My estimation of spam attempts per hour are based on
accidentally unprotecting one wiki during a server migration last month -
about 5 per hour. Currently, zero spam incidents using Asirra in the past
year.

Very useful, thanks. By the way, you could enable the debuglog "captcha" to have a full log to grep for stats, I think (as WMF does).

As a reminder, we've talked a lot about non-text captchas above, but the text captcha solutions can be still considered:

  • wikicaptcha with Wikisource OCR (bug 32695) could do as well as reCAPTCHA which according to Burzstein et al. was broken but still 24 times less than Wikimedia's (fancy)captcha;
  • "just" making a few hundreds language-specific dictionaries and improving the text images generation to be more similar to Google's (as per same Burzstein et al.) would improve fancycaptcha dramatically and solve bug 5309 for Wikimedia at least.

aalekh1993 wrote:

Hello, First of all thank you nemo_bis and Amgine for your points it helped me draft various solutions for the project.

1)As Amgine raised his concern about need to add " add quality monitoring to any other captcha method" , I suggest you to please have a look at the Image Indexing System i described in my proposal: https://www.mediawiki.org/wiki/User:AalekhN/GSoC_proposal_2014#Image_Indexing_System as a key part of the project this indexing system will be designed to improve overtime, and will remove the images that are not globally recognizable (multilingual) or are irrelevant.This image indexing system will downrate the images which the user's reload(while reloading captcha).

2)I have enhanced the approach of " annotations type captcha " to help us build indexes for images, now as an advantage we can use these indexes to build questions for captcha without using effect's on images , these questions can hence be solved easily by Visually Impaired user's .A more detailed explanation of this enhancement is shown here: https://www.mediawiki.org/wiki/User:AalekhN/notes#Indexing_Annotations_Type_Captcha

For the 'odd one out' style, consider a text such as "Select the [category]" >or "Select all that are not [category]". The category will almost always be a >plural noun, which makes the message more easily translated. The message >"Select the images that is different in this group" is ambiguous, unless there >is only one image which is different.

My idea for selecting "odd one out question's" is to present user with 2 options of images which are different from the group, hence it won't be a difficulty for users to determine 2-odd images out from the group.

As a reminder, we've talked a lot about non-text captchas above, but the text >captcha solutions can be still considered:

  • wikicaptcha with Wikisource OCR (bug 32695) could do as well as reCAPTCHA >which according to Burzstein et al. was broken but still 24 times less than >Wikimedia's (fancy)captcha;
  • "just" making a few hundreds language-specific dictionaries and improving the >text images generation to be more similar to Google's (as per same Burzstein et >al.) would improve fancycaptcha dramatically and solve bug 5309 for Wikimedia >at least.

Earlier did analyzed the project from the text captcha point of view, but found following Cons with it:

*)It provides almost the same solution as ReCaptcha which is currently 24% times easily breakable.
*)Words used are mostly English and latin hence not multilingual.
*)Provides the same user experience as provided by recpatcha, hence not user friendly.
*)I thought it as much of replacement of ReCaptcha, but it does not offers solution to an effective captcha.

5)I also did considered a Solution for Visually Impaired/Blind users but after various advices by community members, I postponded the idea to be developed in later phase. the idea i presented is mentoined here:https://www.mediawiki.org/wiki/User:AalekhN/notes#For_blind_and_visually_impaired_users

Qgil added a comment.Mar 25 2014, 6:24 AM

Just a line to say that mobile developer Juliusz Gonera has volunteered to co-mentor this feature with Pau and Emufarmers. Thank you!

The priority is to assess the GSoC candidates with microtasks and whatever evaluation is required to evaluate them. There is not much time left.

wmf.amgine3691 wrote:

Very brief, as I am heading to sleep; I will answer more completely tomorrow:

(In reply to Aalekh Nigam from comment #8)

As a reminder, we've talked a lot about non-text captchas above, but the text >captcha solutions can be still considered:

  • wikicaptcha with Wikisource OCR (bug 32695) could do as well as reCAPTCHA >which according to Burzstein et al. was broken but still 24 times less than >Wikimedia's (fancy)captcha;
  • "just" making a few hundreds language-specific dictionaries and improving the >text images generation to be more similar to Google's (as per same Burzstein et >al.) would improve fancycaptcha dramatically and solve bug 5309 for Wikimedia >at least.

Earlier did analyzed the project from the text captcha point of view, but
found following Cons with it:
*)It provides almost the same solution as ReCaptcha which is currently 24%
times easily breakable.

There are additional problems with ReCaptcha as noted in comment #2, which Fancy Captcha obviates.

*)Words used are mostly English and latin hence not multilingual.

I am currently working on parsing wiktionary dumps by language. The estimated term pool will be > 20 million in 1300+ languages; not all of these will be useful as some are not represented by scripts (e.g. American Sign Language) and others are not represented in easily available fonts (e.g. Bhasa dialects).

*)Provides the same user experience as provided by recpatcha, hence not user
friendly.

Strongly agree. However, it is popular amongst WMF devs.

*)I thought it as much of replacement of ReCaptcha, but it does not offers
solution to an effective captcha.

Keep in mind that tools to solve text-based captcha are focused on solving ReCaptcha; other distortion models are as easily solved in theory but in practice are not.

<wave @ EmuFarmers>

wmf.amgine3691 wrote:

@Nemo_bis: I have 58 dictionaries[1] based on the list of active wikipedias[2]. I actually parsed out 1320 dictionaries based on their language headers in en.Wiktionary. Some of these should, perhaps, be merged for certain wikipedias (e.g. Turkish, Kurdish, and Turkmen for tr.WP), while others would require more complex parsing to be created (zh and zh-classical.) If you have a better list of active languages you would like I can work with that.

The url for the dictionaries will expire in a week.

[1] https://cloud.saewyc.ca/public.php?service=files&t=5e21ead0d34daa49576a44cc89c31def
[2] https://meta.wikimedia.org/wiki/Wikipedia/Versions

wmf.amgine3691 wrote:

(In reply to Aalekh Nigam from comment #8)

1)As Amgine raised his concern about need to add " add quality monitoring to
any other captcha method" , I suggest you to please have a look at the Image
Indexing System i described in my proposal:
https://www.mediawiki.org/wiki/User:AalekhN/
GSoC_proposal_2014#Image_Indexing_System as a key part of the project this
indexing system will be designed to improve overtime, and will remove the
images that are not globally recognizable (multilingual) or are
irrelevant.This image indexing system will downrate the images which the
user's reload(while reloading captcha).

The method described there - when images are reloaded the images are downrated - may not be a good model. Bots which are attempting to break through may reload images continuously in order to capture your entire pool of images. They may also reload several times sequentially in hopes of retrieving an image they 'know' before making a random guess.

It would be preferable to discover if a user trying to pass a captcha and failing, succeeds on a second attempt. If the user succeeds on a second attempt, the previous incorrect choice should be downrated. This model is preferred because the success on the second attempt proves the user is a human, which means the failed first attempt was by a human indicating that previous captcha was difficult for a human to solve.

Over time this should result in captchas known to be difficult for humans to solve being downrated out of the pool.

aalekh1993 wrote:

The method described there - when images are reloaded the images are downrated -
may not be a good model. Bots which are attempting to break through may reload
images continuously in order to capture your entire pool of images. They may
also reload several times sequentially in hopes of retrieving an image they
'know' before making a random guess.
It would be preferable to discover if a user trying to pass a captcha and
failing, succeeds on a second attempt. If the user succeeds on a second
attempt, the previous incorrect choice should be downrated. This model is
preferred because the success on the second attempt proves the user is a human,
which means the failed first attempt was by a human indicating that previous
captcha was difficult for a human to solve.
Over time this should result in captchas known to be difficult for humans to
solve being downrated out of the pool.

Thanks for these great points, mentioned points are noted down and will be implemented in the project.

Now two things I want to mention about this project to make the context of proposal more clear:

1)Point about Creating list of array of Categories for retreving images from Wikidata is mentoined here: https://www.mediawiki.org/wiki/User:AalekhN/notes#Selecting.2FCustomizing_categories_for_images_to_be_displayed

2)Here is my idea about selection of two unrelated categories from Wikidata, which is mentioned here: https://www.mediawiki.org/wiki/User:AalekhN/notes#How_to_make_these_categories_Unrelated

A request to all community members to assign all of us (participants) with microtask for the project.

aalekh1993 wrote:

Two points to mention :

1)There has been edit made in making category unrelated which is mentioned here:

""In order to make categories unrelated we can make super-set of the categories for example:-> there could be categories of artist,astronauts can be categorized under the Super Category of humans, similarly collection of an array can be made comprising of unrelated categories such as “people”, “animals”, “machines" etc...,moreover this array of unrelated categories can be modified by administrators of Wiki's to add more categories according to his need.""

2)A point was made by mentor regarding user experience while using "odd one out question" ,it was proposed to use 2 odd options out of 8 given but using two option can prove problem for users to determine the odd option, hence i would like propose the use of only one odd option out off eight options provided in question ,i request all members to give their feedback regarding it.

In addition url obfuscation can be made in order to make url un-retrievable from source, also use of apertium as described in the proposal can be substituted by use of translate wiki.

Thank You

Could anyone provide any evidence for the claims that:

  • ReCaptcha failure rate is 24%;
  • ReCaptcha is not user friendly;
  • ReCaptcha donates cpu cycles from users to Google and exposes users to Google's data collection processes without informed consent?

From what I can see, ReCaptcha is not less user friendly than any other textual captcha, and information about its purpose is visible when one clicks on the question mark visible with every ReCaptcha.

(In reply to Nikola Smolenski from comment #15)

From what I can see, ReCaptcha is not less user friendly than any other textual > captcha,

http://emufarmers.com/recaptcha1.jpg
http://emufarmers.com/recaptcha2.jpg
http://emufarmers.com/recaptcha3.jpg

Google is fully aware that they're impossible, which is why if they don't think you're a bot they now just give you a house number to digitize instead.

http://emufarmers.com/fancycaptcha1.png is not ideal, but there's a chance a human might be able to solve it.

(In reply to Nikola Smolenski from comment #15)

Could anyone provide any evidence

http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf
(brought to our attention by the author of fancycaptcha with http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/56387 )

(In reply to Emufarmers from comment #16)

Google is fully aware that they're impossible, which is why if they don't
think you're a bot they now just give you a house number to digitize instead.

And of course they don't use it for their own products.

wmf.amgine3691 wrote:

(In reply to Nikola Smolenski from comment #15)

Could anyone provide any evidence for the claims that:

  • ReCaptcha donates cpu cycles from users to Google and exposes users to

Google's data collection processes without informed consent?

https://www.google.com/recaptcha/whyrecaptcha

"It's Useful. Why waste the effort of your users? reCAPTCHA helps to _digitize books_." [link to http://www.google.com/recaptcha/learnmore]

(In reply to Emufarmers from comment #16)

(In reply to Nikola Smolenski from comment #15)

From what I can see, ReCaptcha is not less user friendly than any other textual > captcha,

http://emufarmers.com/recaptcha1.jpg
http://emufarmers.com/recaptcha2.jpg
http://emufarmers.com/recaptcha3.jpg

Your examples show highly distorted text, and utmost majority of recaptchas are not so distorted. Even so, if these are more difficult to solve than other captchas, they are equally user-friendly. They only require from users to enter the text that they see.

(In reply to Nemo from comment #17)

(In reply to Nikola Smolenski from comment #15)

Could anyone provide any evidence

http://cdn.ly.tl/publications/text-based-captcha-strengths-and-weaknesses.pdf

That link states that they had 10-24% success rate on CNN's and Digg's captchas and that "Only Google and Recaptcha resisted to our attack attempts"; that is, against Recaptcha they had success rate of 0%.

(In reply to Emufarmers from comment #16)

Google is fully aware that they're impossible, which is why if they don't
think you're a bot they now just give you a house number to digitize instead.

And of course they don't use it for their own products.

That is not true, I just tried to create a new e-mail account at GMail, and they use Recaptcha.

(In reply to Amgine from comment #18)

(In reply to Nikola Smolenski from comment #15)

Could anyone provide any evidence for the claims that:

  • ReCaptcha donates cpu cycles from users to Google and exposes users to

Google's data collection processes without informed consent?

https://www.google.com/recaptcha/whyrecaptcha
"It's Useful. Why waste the effort of your users? reCAPTCHA helps to
_digitize books_." [link to http://www.google.com/recaptcha/learnmore]

Yes, and that page is exactly the way that the users are informed about the way their CPU cycles are donated, so that they may form their consent.

Nikola, I'd like to reply to your comments and to clarify mine, but all this is off topic on this topic. Please raise your doubts on [[mw:Talk:CAPTCHA]] and let's discuss there without hijacking this bug.

aalekh1993 wrote:

Just an update about need for image indexing system as raised out in a mail here:
http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075577.html

aalekh1993 wrote:

An important read to about ReCaptcha decipher , through algorithm created by Google: http://techcrunch.com/2014/04/16/googles-new-street-view-image-recognition-algorithm-can-beat-most-captchas/

wmf.amgine3691 wrote:

(In reply to Aalekh Nigam from comment #22)

An important read to about ReCaptcha decipher , through algorithm created by
Google:
http://techcrunch.com/2014/04/16/googles-new-street-view-image-recognition-
algorithm-can-beat-most-captchas/

Another reason to use concept photos ("Select the $tag images") rather than decipher the script. It is more easily multilingual, and less susceptible to machine interpretation imo.

It does require humans to properly index images, which can also be integrated into the captcha as a self-improving algorithm.

Qgil added a comment.Sep 12 2014, 9:11 AM

I have removed this project from https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects because of lack of consensus. If/when there is a project clearly formulated that has community support, then we can post in this page.

aalekh1993 wrote:

Much needed step, also in future please make sure to remove any project that does not have Community Support, as it take good amount of effort and dedication on the student part to resarch about the project, and last minute discussion about the feasibility affects the moral of the student....Although i still believe the project is awesome :)

Nemo_bis set Security to None.Dec 11 2014, 8:52 AM
Nemo_bis removed a subscriber: Nemo_bis.
Jdlrobson moved this task from Needs triage to Triaged on the Mobile board.Jan 21 2015, 12:10 AM
Nemo_bis updated the task description. (Show Details)Mar 14 2017, 3:53 PM
Qgil removed a subscriber: Qgil.Mar 15 2017, 5:32 PM