Review and Deploy Wikicaptcha
OpenPublic

Description

Author: sumanah

Description:
Idea: Write a version of reCAPTCHA (for use by ConfirmEdit) that uses document images that have been processed by MediaWiki's ProofreadPage extension for WikiSource. In other words, a CAPTCHA that feeds data to ProofreadPage to augment its OCR processing. Some existing code to build on: http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/thread.html#56121 (Neil Harris & ConfirmEdit)


Version: unspecified
Severity: enhancement
URL: https://wikimania2012.wikimedia.org/wiki/Submissions/Wikicaptcha:_a_ReCAPTCHA-like_solution_for_Wikisource
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=62960

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz32695.
bzimport created this task.Via LegacyNov 28 2011, 10:28 PM
Nemo_bis added a comment.Via ConduitNov 29 2011, 6:37 PM

This has been discussed a few times and a proof of concept was produced: http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.html
If I remember correctly, starting from a properly mapped DjVu it's not so difficult to identify the words which need to be checked, extract the corresponding (portion of) image and put the new text back in the DjVu.
It's way less obvious how to translate the activity on a Page: to the corresponding DjVu page and vice versa.

bzimport added a comment.Via ConduitAug 30 2012, 6:31 PM

sumanah wrote:

Alex, is wikicaptcha, in its current form, ready for a deployment review? Or is it still in an experimental/prototype phase? It would probably be good to clarify that in the README at https://github.com/CristianCantoro/wikicaptcha .

Am cc'ing Andrea Zanni (Aubrey).

Thanks for working on this!

bzimport added a comment.Via ConduitSep 17 2012, 3:13 AM

sumanah wrote:

Alex, it looks like WikiCAPTCHA awaits a design review https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we can move forward with deploying it on Wikimedia sites. Just wanted to let you know. Thanks.

Qgil added a comment.Via ConduitMar 25 2013, 12:49 AM

This is a very nice idea! What is the status? Would a Google Summer of Code project help getting a MediaWiki extension running and polished, ready to be used in any MediaWiki enabled site?

Another question would be whether this extension is put in use in Wikimedia sites.

If the idea makes sense and there is at least one mentor available I would like to push it as a candidate to

http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects

and move it to https://www.mediawiki.org/wiki/Summer_of_Code_2013#Project_ideas

Bawolff added a comment.Via ConduitApr 1 2013, 9:09 PM

(In reply to comment #3)

Alex, it looks like WikiCAPTCHA awaits a design review
https://www.mediawiki.org/wiki/WMF_Project_Design_Review_Process before we
can
move forward with deploying it on Wikimedia sites. Just wanted to let you
know. Thanks.

The code looks to be an early prototype. I only did a five minute read through but it looks to be a proof of concept, not a feature complete implementation.

Open questions about this whole idea:
*how would data propogate back to wikisource.
*is this even effective as a captcha
the dataset used to generate the images are publically available. It is unclear that the dataset is large enough that someone downloading the entire thing wouldn't happen.
an attacker could add entries to the dataset. Im not sure how exploitable that is, but its something that is concerning
**its unclear this will actually prevent spam. Computers do not get bored. Even with 1% getting through, it would not be effective. This is using texts that ocr software marked as low confidence, which sounds significantly weaker than what recaptcha does according to wikipedia and ive heard rumours that recaptcha is not entirely effective. (Not sure if this is true).

Nikola_Smolenski added a comment.Via ConduitJun 23 2013, 11:38 AM

smolensk wrote:

To answer to the open questions:

(In reply to comment #5)

*how would data propogate back to wikisource.

I don't see that it is practically possible to propagate data back to Wikisource. Rather, this would be used to perform initial OCR for Wikisource, perhaps primarily for works where machine-based OCR would be ineffective.

*is this even effective as a captcha

I don't see that it would be any less effective than the current captcha.

**the dataset used to generate the images are publically available. It is
unclear that the dataset is large enough that someone downloading the entire
thing wouldn't happen.

Actual dataset used on Wikipedia doesn't need to be publicly available.

**an attacker could add entries to the dataset. Im not sure how exploitable
that is, but its something that is concerning

I don't see how could an attacker add entries to the dataset. Actual dataset used on Wikipedia would probably be tightly controlled.

**its unclear this will actually prevent spam. Computers do not get bored.
Even
with 1% getting through, it would not be effective. This is using texts that

I don't see that it would be any less effective than the current captcha.

Nemo_bis added a comment.Via ConduitJun 24 2013, 11:32 AM

(In reply to comment #6)

> **its unclear this will actually prevent spam. Computers do not get bored.
> Even
> with 1% getting through, it would not be effective. This is using texts that

I don't see that it would be any less effective than the current captcha.

Anything less than the current 25 % failure would be an improvement, though over 1 % a captcha is considered broken (according to the paper on [[mw:CAPTCHA]]).

Jaredzimmerman-WMF added a comment.Via ConduitJun 27 2013, 9:34 PM

This is a low priority roadmap feature, the Product and Design teams would welcome community support.

Please contact me for design review when prototype is ready to review by UX team.

Alex_brollo added a comment.Via ConduitJun 27 2013, 10:33 PM

alex.brollo wrote:

I'm exploring a new and IMHO interesting path: to ignore djvu text layer, and toparse (both to extract naked text layer and some interesting parameters) from abbyy.xml file. This file (really heavy and discouraging at a firs glance) is published by Internet Archive into its file download area.

The interesting thing is, that that heavy file contains both coordinates of words, and an interesting 'wordPenalty' parameter, something like a "uncertainty score" for the whole word; but there's too a character-by-character score of "certainty score".

I'm sharing scripts with http://www.mediawiki.org/wiki/User:Rtdwivedi, who is MUCH skilled than me, since the idea is to upload text layer from abbyy.xml file and to wrap uncertain words into a span tag, making them easy to be fized by VisualEditor. A test output of extracring scripts can be seen into any page of http://it.wikisource.org/wiki/Indice:Ricordi_di_Londra.djvu, where words with a wordPenalty > 0 are red; unluckily VisualEditor doesn't run presently in wikisource, but you can test the resulting code with VisualEditor in a wikipedia sandbox.

I presume that similar scripts, using abbyy.xml files, could extract lists of uncertain words and their images from abbyy.xml file and related scans and feed a CAPTCHA engine.

My suggestion is, to ask Rtdwivedi for comments; personally I feel myself curious, bold and sometimes lucky, but very far from a "programmer".

bzimport added a comment.Via ConduitJul 7 2013, 6:07 AM

ellydwivedi2093 wrote:

Hi everyone,

As Alessandro said, the words that should be chosen for CAPTCHA from the DjVu layer should be chosen on the basis of their confidence level. The confidence level of words shall be decided by the ProofreadPage extension itself. Words with high penalty would be used for CAPTCHA. I would also suggest not using the words in their complete sense, but mixing two high penalty words together. Presently, ProofreadPage extension doesn't have the facilities to do so. The spell checker( which would use the word penalty ) would be implemented after the integration with VisualEditor has been done.
greg added a comment.Via ConduitAug 29 2013, 6:29 PM

Hello, this is a quasi-automated-but-not-really message:

I am reviewing all tracking bugs for extensions to review and deploy to WMF servers. See the list here:
https://bugzilla.wikimedia.org/showdependencytree.cgi?id=31235&hide_resolved=1

The [[mw:Review queue]] page lists the steps necessary to complete the review. I have copied them below and done some initial filling out based on what I can easily gleen from this bug and any linked to sources that are obvious. If I miss something/state something false, please do correct me.

Also, if you haven't yet done so, please review the information on and linked to from:
https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment

TODO/Check list

Extension page on mediawiki.org: no?
Bugzilla component: no?
Extension in Gerrit: in github, please transfer to gerrit
Design Review: not yet done (see comment 8
Archeticecture/Performance Review: some
Security Review: no?
Screencast (if applicable): no
Community support: seems to be the initial beginnings of (at least some of the tech community)

Other than the obvious things above that are 'no's, what else can I/WMF help with here to move it along?

Jaredzimmerman-WMF added a comment.Via ConduitAug 30 2013, 11:56 PM

Is there a "working" prototype that the functionality can be testing somewhere (without setting up a development environment) that Design can evaluate.

Bawolff added a comment.Via ConduitAug 31 2013, 5:35 AM

(In reply to comment #12)

Is there a "working" prototype that the functionality can be testing
somewhere
(without setting up a development environment) that Design can evaluate.

I'm not exactly sure why a design review would be needed at this stage. The design is probably going to look very much like what the current captcha looks like, since its mostly proposed replacing the backend, not the front end.

/me still thinks my questions in comment 5 aren't sufficiently answered. I'd like answers to the tune of "we know this will be a good idea because of X", not we think we couldn't possibly do worse than the current system, because the current system sucks so much (Which I wouldn't bet on). Heck I'd even settle for a concrete description (something that could actually be evaluated) of what folks working on this even plan to do.

Nemo_bis added a comment.Via ConduitAug 31 2013, 11:05 AM

I don't understand if that was clear enough, but there isn't any developer working on this project. The contributions Cristian and Alex can make are what they already did and mention: make a proof-of-concept and investigating specifications for interaction with Wikisource, DjVu and so on.

bzimport added a comment.Via ConduitDec 1 2013, 3:46 PM

vladjohn2013 wrote:

Hi, this project is still listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Multilingual.2C_usable_and_effective_captchas

Should this project be still listed in that page? If not, please remove it. If it still makes sense, then it could be moved to the "Featured projects" section if it has community support and mentors.

bzimport added a comment.Via ConduitFeb 17 2014, 6:02 PM

aalekh1993 wrote:

Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:

             
""Develop a captcha service with wikimedia commons images this captcha service 
   will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
   random keyword 1, a question will be generated along with images fetched from commons
   database,also there will be few images from another keyword 2 which will show some
    images which will be not related to the question we can also take help of image
   annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.

bzimport added a comment.Via ConduitFeb 17 2014, 6:03 PM

aalekh1993 wrote:

Hello,
I have been a frequent contributor to mediawiki.....and as a part of contribution process is looking for project for upcoming Google Summer Of Code 2014 ........as mentoined in bug 32695 and prototype developed by Pginer
my idea for the project is:

             
""Develop a captcha service with wikimedia commons images this captcha service 
   will comprise of some custom question with random keyword 1 and 2 .......upon selection of 
   random keyword 1, a question will be generated along with images fetched from commons
   database,also there will be few images from another keyword 2 which will show some
    images which will be not related to the question we can also take help of image
   annotations as mentioned here (https://commons.wikimedia.org/wiki/File:Vitraux_de_la_basilique_Notre-Dame,_Genève_23.jpg).""

I therefore request you all to place comment into my idea regarding the project as i am really interested to work for this challenging but amazing project :) .

Eagerly Waiting for your reply.

Aklapper added a comment.Via ConduitFeb 17 2014, 6:57 PM

(In reply to Aalekh Nigam from comment #17)

I therefore request you all to place comment into my idea regarding the
project as i am really interested to work for this challenging but amazing
project :) .

This should probably go to a wikipage where you explain your idea and where people could comment. Bugzilla might not be the best place for a lenghty discussion. Feel free to paste a link here as a comment.

Nemo_bis added a comment.Via ConduitFeb 17 2014, 7:26 PM

Also, Aalekh, this bug is about Wikisource (scanned books) images, a CAPTCHA from Commons images would need a separate bugzilla report.

bzimport added a comment.Via ConduitFeb 17 2014, 7:40 PM

aalekh1993 wrote:

Actually this was a simple idea for way to handle the project......since commons is a part of wiki....so my idea is that it might just act as an database for various captcha options as mentoined by pginer in http://pauginer.tumblr.com/post/33445896205/captcha-ideas

Qgil added a comment.Via ConduitMar 13 2014, 2:41 PM

Aalekh, your proposal is still missing in Google Melange. Please submit it there as a draft linking to your wiki page. In any case, we will evaluate your proposal in mediawiki.org. Thank you!

bzimport added a comment.Via ConduitMar 20 2014, 3:10 PM

aalekh1993 wrote:

Over a period of few months there has been active Development of Multilingual, usable and effective captchas for GSOC 2014.But currently it seems that there is no technical and primary mentor for the project. Therefore I Request all members to please have a thought about becoming a part of this project as primary technical mentor.

Qgil added a comment.Via ConduitMar 22 2014, 6:13 PM

Let's move the GSoC 2014 discussion to

Bug 62960 - Prototype CAPTCHA optimized for multilingual and mobile

Nemo_bis awarded a token.Via WebDec 12 2014, 8:26 AM
Ricordisamoa awarded a token.

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.