Automatically detect spambot registration using machine learning (like invisible reCAPTCHA)
Open, Needs TriagePublic

Description

NOTE: details of this task are tracked in the AICaptcha project.

Wikimedia's captchas are fundamentally broken: they keep users away but allow robots in. While they can filter out the most stupid spambots, they are easily breakable with off-the-shelf tools. (T141490) At the same time, they take significant effort and often multiple tries for a human to solve (research), and are especially bad for people with visual impairments (T6845) and those who don't speak English or don't even use Latin script (T7309). Our captcha stats (T152219) show a failure rate of around 30% (and that does not count users who don't even submit the form; there is about one captcha submission per hundred captcha displays, but we don't know to what extent that's crawlers/spambots).

AI could help to build something like reCAPTCHA (that does not violate our privacy policy): a two-tier system where users are given a trivial test (click the button - could even be integrated into clicking the usual button), the system collects as much information (timing, mouse movements, browser details etc) as possible and makes a judgement; suspicious users are given a harder test (which could just be a regular captcha, but if we can generate questions based on image recognition or other hard-for-robots-easy-for-humans tasks, even better). Maybe even make the first test invisible, like Google does with invisible reCAPTCHA (where the easy test is basically just clicking the registration button).

See also

Related Objects

Mentioned In
T204615: Generate new Captcha word list for prod
T178463: Automatically detect spambot registration using machine learning like invisible reCAPTCHA (Vinitha V S)
T181952: Requesting access to EventLogging data for Vinitha
AICaptcha
T179635: Allow captchas to be stacked
T174528: Tool for displaying a user contribution summary
T177034: Outreachy microtask: write a CAPTCHA plugin that can fall back to another algorithm
T177033: Outreachy microtask: analyze sample mouse movement data and extract feature vectors
T158296: Translation outreach: User guides on MediaWiki.org
T174874: Create a standalone Wikimedia CAPTCHA service
Z610: Outreachy T158909 (AI CAPTCHA) questions
T175331: Outreachy microtask: collect captcha data from signup page (#2)
T175330: Outreachy microtask: collect captcha data from signup page (#1)
T141490: Deploy improved FancyCaptcha
Mentioned Here
T178814: Proposal: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA)
T178697: Proposal: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA)
T178565: Proposal: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA)
T178463: Automatically detect spambot registration using machine learning like invisible reCAPTCHA (Vinitha V S)
T177034: Outreachy microtask: write a CAPTCHA plugin that can fall back to another algorithm
T177033: Outreachy microtask: analyze sample mouse movement data and extract feature vectors
Z610: Outreachy T158909 (AI CAPTCHA) questions
T100706: Revamp anti-spamming strategies and improve UX
T175330: Outreachy microtask: collect captcha data from signup page (#1)
T175331: Outreachy microtask: collect captcha data from signup page (#2)
T174528: Tool for displaying a user contribution summary
T86460: testwiki provides a vector to create production accounts (valid on all of CentralAuth) without a CAPTCHA
T6845: CAPTCHA doesn't work for blind people
T7309: Localize captcha images
T141490: Deploy improved FancyCaptcha
T152219: Statistics on Captcha success/failure rate
There are a very large number of changes, so older changes are hidden. Show Older Changes
Tgr updated the task description. (Show Details)Oct 7 2017, 11:32 PM
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)Oct 7 2017, 11:48 PM
Tgr updated the task description. (Show Details)Oct 7 2017, 11:53 PM
Tgr updated the task description. (Show Details)Oct 8 2017, 12:01 AM
Reedy removed a subscriber: Reedy.Oct 8 2017, 12:02 AM
Tgr updated the task description. (Show Details)Oct 8 2017, 12:03 AM
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)Oct 10 2017, 1:29 AM
Tgr updated the task description. (Show Details)Oct 10 2017, 1:51 AM
Tgr updated the task description. (Show Details)Oct 10 2017, 2:12 AM
Tgr updated the task description. (Show Details)Oct 10 2017, 2:53 AM

Hi, looked through all Outreachy projects, found this one interesting.

Will start with microtasks in a day or two. Same with the application.

I haven't worked on any open source projects earlier, and this is going to be my first one if selected. Will not having open source project contribution earlier be a drawback?

Change 383714 had a related patch set uploaded (by Sam0410; owner: Sam0410):
[mediawiki/core@master] [DO NOT MERGE] Outreachy Task T158909

https://gerrit.wikimedia.org/r/383714

srishakatux added a comment.EditedOct 12 2017, 12:32 AM

@tkasarla Hello! and thank for your interest in participating in Outreachy. Remember, that you have less than two weeks before the application deadline and you may not have enough time to go through all the application steps https://www.mediawiki.org/wiki/Outreachy/Participants#Application_process_steps. If you want to be ambitious and give it a shot, great! If not, and you are interested in contributing to Wikimedia projects, please see https://www.mediawiki.org/wiki/New_Developers. Also, not having open source project contribution earlier does not matter :)

Tgr updated the task description. (Show Details)Oct 15 2017, 12:52 AM
Tgr added a comment.Oct 15 2017, 8:32 AM

Hi all!

If you are already working on a microtask/application: the deadline is October 23, and you have to finalize your application by then, and finish at least one microtask, but you can continue working on the other microtasks afterwards (probably for one week or so). Keep that in mind when prioritizing what to work on.

If you haven't started yet and are still in the process of looking for a project: this one is a bit crowded already. In the list of projects some have the comment "Needs More Applicants"; you'll probably have an easier time with those.

Tgr updated the task description. (Show Details)Oct 16 2017, 3:59 AM
Tgr updated the task description. (Show Details)
SAM0410 removed a subscriber: SAM0410.Oct 17 2017, 12:42 PM

Hi,
I am Vinitha. I heard about the Outreachy program recently. I am excited to get involved with the MediaWiki project(spambot registration detection). I have started working on microtasks and will post updates soon :)

@Groovier Hello and thanks for your interest! Check @Tgr's comment

If you haven't started yet and are still in the process of looking for a project: this one is a bit crowded already. In the list of projects some have the comment "Needs More Applicants"; you'll probably have an easier time with those.

If you would like to contribute to Wikimedia projects, check our New Developers guide https://www.mediawiki.org/wiki/New_Developers.

Change 384753 had a related patch set uploaded (by Groovier1; owner: Groovier1):
[mediawiki/core@master] T158909 Automatically detect spambot registration using machine learning: Tracking mouse click position on the create your account button

https://gerrit.wikimedia.org/r/384753

I have raised a code review at https://gerrit.wikimedia.org/r/384753 . Please take a look

Tgr updated the task description. (Show Details)Oct 17 2017, 9:44 PM
Tgr added a subscriber: SAM0410.
Tgr updated the task description. (Show Details)Oct 18 2017, 2:21 AM

Great job getting started on fixing this!!!

Hi @Tgr ,

Thanks for the comments. I have made the changes and raised a fresh code review. Kindly review .

Groovier added a comment.EditedOct 18 2017, 5:54 AM

@UpsandDowns1234 Thank you :)

@Tgr I have incorporated the changes in the review.. The lint tests are failing on my current code. I will push the corrected one immediately. Sorry for the trouble.

Tgr updated the task description. (Show Details)Oct 18 2017, 6:10 AM
Tgr updated the task description. (Show Details)Oct 18 2017, 9:10 AM
Tgr updated the task description. (Show Details)Oct 18 2017, 9:13 AM
Groovier added a comment.EditedOct 18 2017, 10:33 AM

Hi @srishakatux, @Tgr

I have taken quite some time to set up vagrant and solve the proxy issues and hence the delay in doing the tasks. Now since I have set up the basic requirements and have loved this work, I am keen to continue working on this. I have worked on the first task and I understand that I have to work on more tasks to gain more insight about this project.

  1. I think my next step would be to focus on solving other tasks and parallely read and research more about the issue and solutions. Is there anything else I should be taking care of?
  1. I have filled in the application form (only eligibility part) so that I can confirm my eligibility. I am not sure how to complete the detailed questions in the application form as of now. Is it ok for me to fill that eventually?

Thank you

Tgr updated the task description. (Show Details)Oct 18 2017, 8:19 PM
Tgr updated the task description. (Show Details)Oct 18 2017, 8:26 PM
SAM0410 removed a subscriber: SAM0410.Oct 18 2017, 9:01 PM
Tgr updated the task description. (Show Details)Oct 19 2017, 2:18 AM
Tgr updated the task description. (Show Details)
Tgr added a subscriber: SAM0410.
Tgr removed a subscriber: SAM0410.
Tgr added a comment.Oct 19 2017, 8:30 AM

Hi all, reminder that you have to finish your application / project proposal, and publish the non-eligibility-related part of it as a Phabricator task, until the application deadline (Oct 23). See application step #9.

We'll mostly look at the Phabricator version of your proposals (the outreachy.org forms are a pain to read, have no rich text or change tracking) so make sure everything you consider important is present there. (Except for the eligibility-related information; that's only needed in the outreachy.org form.)

You can work on microtasks a little longer if you want (I'd guess until end of October but I don't know the exact date).

Tgr added a comment.Oct 19 2017, 8:31 AM

@Groovier see above, you should focus on the application for now.

Tgr updated the task description. (Show Details)Oct 19 2017, 6:47 PM
Tgr updated the task description. (Show Details)Oct 21 2017, 7:27 AM

Hello @Tgr , I just had a quick question!
Should the application proposal as Phabricator task be like application template mentioned in application step #9 or the Outreachy proposal application template non-eligibility part?
Because the questions are contrasting in some aspects.

Hi all, reminder that you have to finish your application / project proposal, and publish the non-eligibility-related part of it as a Phabricator task, until the application deadline (Oct 23). See application step #9.

We'll mostly look at the Phabricator version of your proposals (the outreachy.org forms are a pain to read, have no rich text or change tracking) so make sure everything you consider important is present there. (Except for the eligibility-related information; that's only needed in the outreachy.org form.)

You can work on microtasks a little longer if you want (I'd guess until end of October but I don't know the exact date).

Tgr added a comment.Oct 21 2017, 11:52 PM

@Nehagup when in doubt go with the Wikimedia application form, it was made for specifically this purpose. The questions there seem to me pretty similar to the ones in the Outreachy form, though.

Tgr updated the task description. (Show Details)Oct 22 2017, 12:02 AM
Tgr updated the task description. (Show Details)Oct 23 2017, 7:27 PM
Tgr updated the task description. (Show Details)Oct 23 2017, 7:41 PM
Tgr updated the task description. (Show Details)Oct 27 2017, 12:43 AM
Tgr updated the task description. (Show Details)Oct 28 2017, 10:03 AM
Tgr updated the task description. (Show Details)Oct 29 2017, 8:05 PM
Tgr moved this task from Next to Pending on the User-Tgr board.Oct 31 2017, 12:01 AM
Tgr added a comment.Nov 2 2017, 10:07 AM

Thanks all for contributing! The selection process has ended; the results will be published on Nov 9. If you would like to continue working on any open tasks, or contribute code in some other way, you are welcome to do so and I will provide code review if time permits, but it will not influence the selection.

If you *don't* want to finish a gerrit patch, please use the Abandon button so it's not marked as needing review anymore.

Thank you so much @Tgr and @awight. This means a lot to me.. :) Looking forward to working closely with you all.

Change 384753 abandoned by Gergő Tisza:
[DO NOT MERGE] Outreachy micotask T158909

Reason:
Abandoning all Outreachy microtask related changesets; the application period is over. For contributing outside Outreachy, see https://www.mediawiki.org/wiki/New_Developers and https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker .

https://gerrit.wikimedia.org/r/384753

Tgr updated the task description. (Show Details)EditedNov 11 2017, 8:23 PM

Archiving information related to the Outreachy application period into a comment.

Outreachy information

Skills needed: basic PHP/JS (for collecting data / integrating with the machine learning system), Python, machine learning
Mentors: @Tgr, @awight
Microtasks:

Please see these portals for more about how to apply to work on a MediaWiki project through Outreachy:
https://www.mediawiki.org/wiki/Outreachy/Round_15
https://www.mediawiki.org/wiki/Outreachy/Participants

usereligibilitytask1task2task3task4CI whitelistproposal
@Groovierlink c384753 c385845 github in progressc387080 in porgressaddedT178463
@Kamsuri5link c377044 in progressgithub in progressT178814
@Nehaguplink c379990 c381787 github in progressc382974 in progressaddedT178565
@Sagorika1996link github in progress
SAM0410link c382842 in progressc383714 in progressgithub in progressin progress
@Smaritalink c380466 c383765 gitub in progressc383299 in progressaddedT178697
@Sofmonklink c382155 in progressc382717 in progressgithub in progress
@Veenasankarc377031 in progress
Tgr added a comment.Nov 11 2017, 8:24 PM

Congrats @Groovier on being accepted and thanks everyone else for participating!

Tgr updated the task description. (Show Details)Nov 11 2017, 8:27 PM

Change 383714 abandoned by Gergő Tisza:
[DO NOT MERGE] Outreachy Task T158909

https://gerrit.wikimedia.org/r/383714

Tgr edited projects, added AICaptcha; removed Patch-For-Review.
Tgr updated the task description. (Show Details)Jan 1 2018, 9:20 PM
Krinkle added a subscriber: Krinkle.

Mentioned in SAL (#wikimedia-cloud) [2018-01-11T17:50:28Z] <tgr> added Groovier1 to project members for T158909

ToBeFree added a subscriber: ToBeFree.
Tgr updated the task description. (Show Details)Apr 1 2018, 2:03 PM
Tgr updated the task description. (Show Details)Apr 1 2018, 2:08 PM
Tgr added a comment.Apr 1 2018, 2:10 PM

The Outreachy project has concluded (a presentation of the results is available); the original goal was not doable in three month (as it turns out most spambots do not even try to emulate the keyboard / mouse, and the remaining ones are too few to produce enough data in a couple weeks) but we learned a couple useful things about spambots and have some longer-term plans on how to address them. This task will live on as a volunteer project. Thanks Vinitha for taking it so far!

Good to hear the results. Congratulations Vinitha on the research done.

Lofhi added a subscriber: Lofhi.Apr 3 2018, 2:51 PM

@Tgr I'm guessing this task should not still live under Outreach-Programs-Projects? I am boldly removing the tag as we are cleaning up this workboard and planning on killing Possible-Tech-Projects.

ToBeFree rescinded a token.Jul 7 2018, 4:21 AM
ToBeFree awarded a token.