Page MenuHomePhabricator

Deploy improved FancyCaptcha
Open, Needs TriagePublic

Description

In 2014, I investigated FancyCaptcha's resistance to OCR. I found that it had essentially no resistance, that it could be trivially broken by open source software without image preprocessing or OCR engine configuration.

In these two changes, I implemented changes which were confirmed to defeat such naïve OCR attacks. Specifically, I tweaked the tunable parameters to improve distortion of the baseline, and added low-spatial-frequency noise and a gradient to defeat thresholding.
These changes were never deployed to WMF. I propose now doing so.

Here is some representative output:

OldNew

The procedure to regenerate the captcha image set is documented at https://wikitech.wikimedia.org/wiki/Generating_CAPTCHAs

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Could we add some background distorsion as well in adition to that grey "rainy" background, such as lines crossing the words, etc.?

I wonder if we should escalate this task and restrict its visibility? (not sure if we want spammers to know which are our next steps) I'm not sure, hence just asking.

It would be cool if we could plan to measure various metrics when we roll this out (Does it have an effect on how many new users sign up/complete edits? Does it actually reduce spam?). Not sure if we already have systems in place to measure that sort of thing.

Peachey88 added a subscriber: Reedy.Aug 1 2016, 7:35 AM

I wonder if we should escalate this task and restrict its visibility? (not sure if we want spammers to know which are our next steps) I'm not sure, hence just asking.

I think we should roll out what we have now, since its public anyway. Has it been scheduled? Can the normal SWAT team run the needed scripts? or do we need a Operations/a higher team member to handle the rollout?

We could probably open another security level ticket to discuss newer improvements if they are desired.

Quoting myself from 2014:

Note, if the authors of the paper above are right, «Use collapsing or lines: Given the current state of the art, using any sort of complex background as an anti-segmentation technique is considered to be insecure. Using lines or collapsing correctly are the only two secure options currently». Of course it's unpredictable what portion of Wikimedia wikis' spambots are using stupid captcha breakers à la tesseract; but if the attackers are few and expert they'll beat this version very quickly.

I think we should roll out what we have now, since its public anyway. Has it been scheduled? Can the normal SWAT team run the needed scripts? or do we need a Operations/a higher team member to handle the rollout?

From what I can tell, it should just be a normal maintenance script.

The page on wikitech mentions running it as sudo for the apache user, but given we now store things in swift, that's probably not even needed anymore.

Even ignoring the improvements Tim made, we should probably run the script anyways. Current Captchas are from 2013. Re-running the script at regular intervals prevents an attacker from trying to decode a significant portion of the captchas, and then re-using the answers. Assuming the --fill parameter on the page at wikitech is the one that was used, and we only have 10,000 captchas floating around, an attacker would have only had to decoded about 100 captchas before that attack starts to make sense (Birthday paradox), which is really not very much. We should probably use a --fill parameter higher than 10,000 when regenerating the captchas.

Tgr added a subscriber: Tgr.Aug 3 2016, 10:28 PM

It would be cool if we could plan to measure various metrics when we roll this out (Does it have an effect on how many new users sign up/complete edits? Does it actually reduce spam?). Not sure if we already have systems in place to measure that sort of thing.

https://grafana.wikimedia.org/dashboard/db/authentication-metrics should show the direct effects on registration/login.

Tgr added a comment.Aug 3 2016, 10:33 PM

Assuming the --fill parameter on the page at wikitech is the one that was used, and we only have 10,000 captchas floating around, an attacker would have only had to decoded about 100 captchas before that attack starts to make sense (Birthday paradox), which is really not very much. We should probably use a --fill parameter higher than 10,000 when regenerating the captchas.

I don't think the birthday paradox is applicable here. It says that after 100 random picks from a range of 10K, you have a fair chance at collision. In other words, after every 100 cracked captchas the attacker gets one for free, which is not a big deal. (Not disagreeing with your wider point, just nitpicking :)

I don't think the birthday paradox is applicable here. It says that after 100 random picks from a range of 10K, you have a fair chance at collision. In other words, after every 100 cracked captchas the attacker gets one for free, which is not a big deal. (Not disagreeing with your wider point, just nitpicking :)

The idea of this attack is that if you crack 100 captchas, you will get a captcha you have solved before on 1% of requests. And since we don't rate-limit reloads of the login page, that means you can automatically reload the page until you get a captcha in your dictionary. In other words, solving sqrt(N) captchas allows a brute force attack of sqrt(N) strength.

I don't think it's a big deal either, since the security of our captcha is a total joke. If the captcha can easily be broken by OCR, the attacker has no incentive to implement complex solutions to avoid doing OCR.

PleaseStand added a subscriber: PleaseStand.EditedAug 27 2016, 4:48 AM

The procedure to regenerate the captcha image set is documented at https://wikitech.wikimedia.org/wiki/Generating_CAPTCHAs

To create enough new captchas so that the rough total is, say, 10000, run the following command (on terbium):

sudo -u apache mwscript extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php aawiki --wordlist=/home/aaron/words --font=/usr/share/fonts/truetype/freefont/FreeMonoBoldOblique.ttf --blacklist /home/aaron/badwords --fill=10000 --verbose

Is it still necessary to specify --blacklist, now that one is provided with the extension and used by default? Is the word list still located at /home/aaron/words?

Also, if WMF has not been running the script regularly, $wgCaptchaDeleteOnSolve is probably not enabled. GenerateFancyCaptchas.php assumes the setting is enabled: to determine how many CAPTCHA images need to be generated, it subtracts the return value of FancyCaptcha::estimateCaptchaCount() from the specified number. Thus, simply running the script with the same options would cause few or no images to be generated. In any case, the script does not delete existing images.

Hi. What's the status of this? Is there any patch to merge? Thanks.

Hi. What's the status of this? Is there any patch to merge? Thanks.

There's some talking going on behind the scenes with various people. Don't worry, we haven't forgotten about this.

MER-C added a subscriber: MER-C.Sep 29 2016, 8:35 AM
Reedy added a comment.Nov 4 2016, 6:57 PM

Is it still necessary to specify --blacklist, now that one is provided with the extension and used by default? Is the word list still located at /home/aaron/words?

Yes, the blacklist that the WMF uses has more words in it.

They are in that location for now, but ops have put them into the private puppet repo, and as part of T150029 they will be staged on disk at /etc/fancycaptcha/words and /etc/fancycaptcha/badwords

Hi guys. Today we've got a nasty bunch of spambots registering. Can we move forward with this? Thanks!

Reedy added a comment.Jul 2 2017, 4:51 PM

Hi guys. Today we've got a nasty bunch of spambots registering. Can we move forward with this? Thanks!

AFAIK we're still waiting for "the community" to decide we can deploy this improved version

Not sure offhand which ticket had the discussion, or if it was onwiki or whatever

FWIW, it's easy enough to switch over, and I don't have a problem actioning it. The question is who needs to give the sign off

I've locked 200 spambots myself today. What's the best way to achieve consensus for this? RfC linked in tech news?

[ - acl* ; public task, sorry ]

I'd say if @Bawolff / Security-Team and @tstarling is okay, we could deploy it. We ain't sure if it will make any difference indeed, but if it does not it may give us a hint I think.

[ - acl* ; public task, sorry ]

I'd say if @Bawolff / Security-Team and @tstarling is okay, we could deploy it. We ain't sure if it will make any difference indeed, but if it does not it may give us a hint I think.

So in the past, there's been some disagreement over:

  • If community consensus is needed (aka have an RFC on meta) to deploy the new changes
  • What the effect would be in registration of real human users, and more importantly, can we really effectively measure it?
  • What is the affect on spam bot registration, and can we measure it?
  • What is the affect on percentage of successful solves by real humans, and can we measure it?

Maybe a short term deployment would be in order - deploy for say a week and see how that affects spam bots, see if users complain that captchas are harder to solve, and after one week re-evaluate.

In any case, I don't really have strong opinions and @Reedy is the member of Security-Team who knows the most about captcha stuff, so I defer to him.

Reedy added a comment.Jul 2 2017, 5:55 PM

The stats are crappy, see T157735 and some vague numbers in T152219

Since today we're been litteraly flooded again with spambots, I suggest that we deploy them for one or two weeks and see if anything bad happens. That'll allow us to gather too some stats/numbers and see if they have any effect in the counter-spam activities. Once deployed, I suggest we inform User-notice so people is aware. Does that sound right? Could we have it deployed in today's SWAT or earlier? Regards.

Tgr added a comment.EditedJul 3 2017, 10:41 AM

Waiting for users to complain is not a good strategy - power users can deal with it and non-power users don't complain, just leave quietly. Once I accidentally deployed a bug to a major product aimed at non-power users which completely broke it on enwiki in a major browser (10%+ user share) and it took a week to receive the first complaint. It will be much worse for something that specifically targets new user registrations.

Re stats, made a dashboard for convenience: https://grafana.wikimedia.org/dashboard/db/captcha-failure-rates

Is it really immune to thresholding?

looks pretty readable to me after a simple threshold filter in the gimp. Untested as I don't have any ocr software installed to test it against.

Reedy added a comment.EditedJul 3 2017, 5:21 PM

Is it really immune to thresholding?

looks pretty readable to me after a simple threshold filter in the gimp. Untested as I don't have any ocr software installed to test it against.

What options etc was that? In Tims original post...

was proposed to be how they'd look. Which looks quite a bit different to what you've posted :)

It's probably worth noting that Tim made the changes in September 2014, so nearly 3 years ago. OCR software will have improved too in that time...

https://github.com/wikimedia/mediawiki-extensions-ConfirmEdit/commits/master/captcha.py

Since today we're been litteraly flooded again with spambots, I suggest that we deploy them for one or two weeks and see if anything bad happens. That'll allow us to gather too some stats/numbers and see if they have any effect in the counter-spam activities. Once deployed, I suggest we inform User-notice so people is aware. Does that sound right? Could we have it deployed in today's SWAT or earlier? Regards.

I'd advise against running in a swat window. It takes a long time to run the generation script, though, should be somewhat quicker after Florian fixing T157734

I could do it in the Security deploy window, as we have a longer window tonight.

@Reedy that's just threshold set at ~60

deploy for say a week and see how that affects spam bots

+1: a short test may tell us whether it's already useless, although it won't be able to tell us whether the spambots' OCR will be adapted in a few days more.

CAPTCHA is useful, somewhat. I still remember that not so long ago they
disabled it as a test on mediawiki.org and they have to switch it back
within hours due to the sudden increase of spambot registration. It is
better to have it for now until a better solution is found. I know you'll
wave hands at me but maybe we should implement a system like reCAPTCHA
which seems to be working well (at least most sites I visit have been
switching from old systems to that new one, so it might indicate some
success...). Note the like. I know our privacy policy won't allow us to
use reCaptcha directly unless it is possible not to submit user data to
Google, so maybe we could work in creating a MediaWiki extension or update
what we currently have?

Maybe Milimetric could help gather accurate stats for the test period so if
we see a strange peaks of captcha failing we can investigate them?

tstarling added a comment.EditedJul 4 2017, 10:31 AM

Is it really immune to thresholding?

looks pretty readable to me after a simple threshold filter in the gimp. Untested as I don't have any ocr software installed to test it against.

It's not that readable to a computer. Adding the gradient took tesseract success rate from ~10% to <0.1%. I tried preprocessing the images with various thresholds before feeding them to tesseract, and it could generally only get a few of the letters, in the region of the image where the threshold happened to be optimal.

Reedy added a comment.EditedJul 4 2017, 10:33 AM

OOI, was that on a recent version of tesseract? Or was that when your changes were made in 2014?

Just thinking if we should be looking to make further tweaks before trying to use it, and similarly, if newer versions have a better success rate than ones from 3 years ago

Note that OTRS volunteers already receive messages of people who can't even read the current captcha.

OOI, was that on a recent version of tesseract? Or was that when your changes were made in 2014?

Just thinking if we should be looking to make further tweaks before trying to use it, and similarly, if newer versions have a better success rate than ones from 3 years ago

It was the packaged version of whatever Ubuntu I was using in 2014, presumably Trusty, in which case it was Tesseract 3.03. The current stable version is 3.05, just a minor update. The current git master is termed "4.0 alpha" and includes a "new neural network system based on LSTMs, with major accuracy gains", so that may indeed produce different results.

Note that the new FancyCaptcha can be broken with the old Tesseract with a few minutes' work, by just subtracting the gradient (which is fixed), or by using edge detection instead of thresholding. The point is to require those few minutes' work. There's a fair chance the spammers have already done something along those lines, I'm not guaranteeing that this will work.

I'm somewhat interested in setting up a honeypot and using it to test some new ideas against real spambots. I think distorted text is a dead-end, it's not a long-term development direction. So I'm not really interested in doing further tweaks to FancyCaptcha.

Note that OTRS volunteers already receive messages of people who can't even read the current captcha.

I would be fine with just turning it off. But it seemed pretty pointless to deter only the humans, and allow the bots, we should at least be able to deter both, right?

Real people can be added to the captcha-exempt global group temporary to
let them pass the captchas until they are no longer required to solve them.
I never received much complaints in that sense though.

Real people can be added to the captcha-exempt global group temporary to
let them pass the captchas until they are no longer required to solve them.
I never received much complaints in that sense though.

That's ridiculous. a) you can't add to that group unregistered users which is the main case the system requires captcha b) you are saying "if you wish to edit and can't read captcha, you must find somebody to do it for you for the first time and apply for exemption" which will make users saying "in that case I do not want to edit at all" (because they do not love Wikipedia at the moment).

I've received complains for Wikimedia Czech Republic's instructor for senior's courses. In significant amount of cases the instructor must solve the captcha instead of the trainee because they simply can't read it. I really do not think that this will a) decrease the amount of spambots b) decrease number of users stopped by captcha.

I do not think it is good idea to add new captcha which is less readable than the current one.

From my expierence, Wikipedia's captcha is one of the hardest to solve.

Tgr added a comment.Jul 4 2017, 3:33 PM

I know our privacy policy won't allow us to use reCaptcha directly unless it is possible not to submit user data to Google, so maybe we could work in creating a MediaWiki extension or update what we currently have?

See T158909: Automatically detect spambot registration using machine learning (like invisible reCAPTCHA) .

Reedy added a comment.Jul 4 2017, 3:39 PM

That's ridiculous. a) you can't add to that group unregistered users which is the main case the system requires captcha b) you are saying "if you wish to edit and can't read captcha, you must find somebody to do it for you for the first time and apply for exemption" which will make users saying "in that case I do not want to edit at all" (because they do not love Wikipedia at the moment).

I've received complains for Wikimedia Czech Republic's instructor for senior's courses. In significant amount of cases the instructor must solve the captcha instead of the trainee because they simply can't read it. I really do not think that this will a) decrease the amount of spambots b) decrease number of users stopped by captcha.

I do not think it is good idea to add new captcha which is less readable than the current one.

From my expierence, Wikipedia's captcha is one of the hardest to solve.

Plus, the account request workflow is clunky, at best, for people who can't (for accessibility issues etc) complete our captchas as it is

Nice to have someone with the opinion that they're unreadable, so worth digging into it a bit (I should point out that I'm vaguely neutral about it)

Just to be clear, it's for the lack of readability, rather than a lack of it being localised words? (ie in English, not in Czech. Which would seem strange when so many Czech people speak English pretty well, not sure if reading is quite so much of an issue)

And by "seniors" you presumably do mean senior citizen? Which I guess is a common group to potentially have vision issues? Not that that should detract from any reasoning etc.

FWIW I don't find this hard to solve/read, but I do have good vision.

I'm not telling they are totally unreadable (or unsolvable). I'm telling that they are solvable if you have good vison and are significant accesibility issue which should be solved. What about QuestionCaptcha? It will take some time to think out some easy questions and we should switch them too but this will stop spambots totally I think. Or significantly decreasing number of spambots at least.

Yeah, I mean senior citizens.

It will take some time to think out some easy questions and we should switch them too but this will stop spambots totally I think. Or significantly decreasing number of spambots at least.

It will stop non-targeted spam bots. But unless you make say over a million questions, it won't stop people intentionally targeting Wikipedia. Additionally the questions have to be easy enough that everyone can answer, including non-english speakers.

That's true. We can localise them too as well as the interface.

That's true. We can localise them too as well as the interface.

Not without publicly disclosing what the questions/answers are. Which may work for a spam bot not specifically targeting us, but if they are trying to target us, than they would just take all the questions.

Ultimately, the current captcha is the worst possible compromise - Its hard to read for humans, easy to read for machines. We could go in 2 possible directions, make the captchas easier for humans since bots can already read them, or make them harder so bots can't read them.

Reedy added a comment.Jul 4 2017, 4:42 PM

Googles NoCaptcha implementation seems to be getting popular. No idea how successful it is.

The other options being the "select all the dogs" type photo ones... Which would be nice with some way of feeding back the data for categorisation usage or similar on common

Urbanecm added a comment.EditedJul 4 2017, 4:46 PM

This seems like great opinion. There is one known problem with our current privacy policy which prohibits just using NoCaptcha. Maybe we can create own alternative?

Malyacko removed a subscriber: Malyacko.Jul 4 2017, 4:50 PM
Tgr added a comment.Jul 4 2017, 5:16 PM

I would be fine with just turning it off. But it seemed pretty pointless to deter only the humans, and allow the bots, we should at least be able to deter both, right?

Maybe instead of making the captcha harder to read, we could make it easier and see if we can find something that still deters the simplistic spambots it deters now, with less collateral damage?

That's ridiculous. a) you can't add to that group unregistered users which is the main case the system requires captcha

No it is not. We receive from time to time requests to create accounts for people who cannot read CAPTCHA. We create the accounts for them and add them to the global group temporary. I guess people from the enwiki account creation team and their UTRS tool could provide some stats about how many requests for account creation are created using that rationale (addenda: the confirmed local group can be used to avoid new users resolve captcha). I accede it is not optimal though, but what else can we do for now?

What it is really ridiculous is to have volunteers' time absorbed exclusively on locking spambots, and doing so for years.

https://grafana.wikimedia.org/dashboard/db/authentication-metrics should show the direct effects on registration/login.

How interesting. The data from yesterday's API failures around 00:00 UTC matches as well with the quiet period we've seen in the abuse and spam blacklist logs, and also matches more or less the time in which I locked 200 accounts and similar number of IP addresses in a batch detected by our systems. This confirms our suspicion that these are automated programs. Maybe strenghthening CAPTCHA on API requests could be an option as well? We ain't talking about simply registration, they sometimes find a non-blacklisted domain or get around a filter to post actual spam to the wikis. Thanks.

I would be fine with just turning it off. But it seemed pretty pointless to deter only the humans, and allow the bots, we should at least be able to deter both, right?

Maybe instead of making the captcha harder to read, we could make it easier and see if we can find something that still deters the simplistic spambots it deters now, with less collateral damage?

I think if we don't go in the harder direction, we should go in the easier direction. I somewhat suspect (but don't know) that the current captcha would be just as effective as writing on an image with no distortion. I still think its worth deploying this new harder version, even if only for a short time, in order to determine whether it would actually be effective or not. We have very little information on how effective our various options are. We aren't going to find out unless we try.

Tgr added a comment.Jul 4 2017, 5:24 PM

The other options being the "select all the dogs" type photo ones... Which would be nice with some way of feeding back the data for categorisation usage or similar on common

That's not exactly future-proof either, image recognition APIs like Google Vision are pretty accurate in telling what the thing in an image is. Also we would need a secret source of image labels and out projects are not really meant to provide secret things.

Tgr added a comment.Jul 4 2017, 5:28 PM

Maybe strenghthening CAPTCHA on API requests could be an option as well?

That would still affect the official Android/iOS apps at least. And it might or might not affect spambots (they don't necessarily use the API).

tstarling added a comment.EditedJul 5 2017, 2:48 AM

Nobody has explained how to actually interpret the metrics we are collecting. If we deploy this and the failure rate goes up, what is the conclusion? Is it stopping bot edits or deterring humans? Everything is mixed together.

EDIT: I'm doing some more analysis myself and putting it on T152219.

Note that OTRS volunteers already receive messages of people who can't even read the current captcha.

I would be fine with just turning it off. But it seemed pretty pointless to deter only the humans, and allow the bots, we should at least be able to deter both, right?

CAPTCHA is useful, somewhat. I still remember that not so long ago they
disabled it as a test on mediawiki.org and they have to switch it back
within hours due to the sudden increase of spambot registration.

This was https://gerrit.wikimedia.org/r/177494 and https://gerrit.wikimedia.org/r/177708 from December 2014. There are some related notes here: https://www.mediawiki.org/wiki/Extension:ConfirmEdit/FancyCaptcha_experiments.

Nobody has explained how to actually interpret the metrics we are collecting. If we deploy this and the failure rate goes up, what is the conclusion? Is it stopping bot edits or deterring humans? Everything is mixed together.

EDIT: I'm doing some more analysis myself and putting it on T152219.

The primary metric i wanted to look at is number of newly registered accounts globally locked for being a spam bot. This would be a very direct measure of success in the short term. As you said somewhere else (not sure where), it would be difficult to get meaningful metrics on how readable captchas are unless we had a known pool of real humans, so i dont know about that side of it.

@Bawolff Does the metrics at T125132#3339987 help you in any way (note: it needs to be adjusted to pull the last months)? Those showed the number of "spam-only account: spambot" locks on global accounts. Many of them are pretty new, although it is not strange that a spambot registers and is dormant for some time before they try to spam and get caught by SpamBlacklist/AbuseFilter.

Ping. Status please?

Reedy added a comment.Feb 20 2018, 7:15 PM

Ping. Status please?

Same as before

Looks like T186244: Deploy AICaptcha data collection is getting some movement though

@Reedy What about a deployment.wikimedia beta cluster test? That wiki is only getting spambot registration. We could test there if the new fancy captcha is of any help?

Reedy added a comment.Feb 20 2018, 7:25 PM

Only if we have a way of measuring it.. Otherwise it's just guessing.

Also, the word lists for beta are much more limited... So is it a fair test?

Do we know if captchas are even being regenerated on beta? Is the cronjob deployed to do it? I'm guessing by there being 949... Probably not?

There's no deployment-terbium.. And deployment-tin's www-data user crontab doesn't have anything for regenerating captchas...

So presumably resolving that should be a pre-requisite?

I run maintenance scripts on deployment-tin absent a better place...

Reedy added a comment.Feb 20 2018, 8:08 PM

I run maintenance scripts on deployment-tin absent a better place...

T187826: Create mediawiki::maintenance server (aka terbium) in deployment-prep

Tgr added a comment.Feb 20 2018, 8:14 PM

The AICaptcha data might help differentiating between human and bot captcha failures, although right now the data does not include captcha success status (but that could be improved). It's not working reliably in beta though (EventLogging seems to be flaky there), plus I doubt that you get many human registrations on deploymentwiki, or even the whole of beta.

So this has been stuck for a while. I decided to look into what else we can do with hopefully less contention.

I came up with https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ConfirmEdit/+/446489 . In my testing it has similar resistance to tesseract as Tim's (In my test, Tim's seemed to be about 6% where this was 6.5%), but I think it will be lot less controversial.

Example images can be found at https://tools.wmflabs.org/bawolff/captcha/setG/

https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:RecentChanges should be a good place where to start testing this new improved captcha system. Absolutely all accounts you see in there are spambots (you can tell by the pattern). I don't think there would be any issues if we deploy those to deploymentwiki only and check how good are they (metrics, etc?).

Yes, but:

check how good are they

We can only test the effectiveness against generic MediaWiki spambots, e.g. whether a) it's true that such spambots use tesseract or other OCR and b) such new fancycaptchas would make life harder for those. You can't really test how well they'd work in reality, because real spammers will presumably just solve all our captchas however hard they are (in the recent flood T212667, nearly 100 % captchas were solved correctly according to authentication metrics).