Statistics on Captcha success/failure rate
Open, NormalPublic

Description

Be useful to have some stats on captcha breakages etc

Logging done by ConfirmEdit currently

Method
    log
Found usages  (4 usages found)
    Method call  (4 usages found)
        MediaWiki  (4 usages found)
            extensions/ConfirmEdit/SimpleCaptcha  (4 usages found)
                Captcha.php  (4 usages found)
                    SimpleCaptcha  (4 usages found)
                        passCaptcha  (3 usages found)
                            1165$this->log( "passed" );
                            1171$this->log( "bad form input" );
                            1176$this->log( "new captcha session" );
                        passCaptchaLimited  (1 usage found)
                            1123$this->log( 'User reached RateLimit, preventing action.' );
Reedy created this task.Dec 2 2016, 5:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 2 2016, 5:21 PM
Nuria added a subscriber: Nuria.Dec 5 2016, 4:55 PM

FYI that analytics team doesn't have this data. We do not agreggate application data coming from mediawiki for the most part, other than edits.

There is some of this data in editing: https://edit-analysis.wmflabs.org/compare/ and I imagine some would belong to account creation.

To be clear: analytics can hold this data once created/agreggate it and preserve it if needed, but given that captchas are in limbo of ownerships right now I am not sure the precise data on this regard is being created.

Reedy added a comment.EditedDec 5 2016, 5:08 PM

There's plenty of log files on fluorine that can be trivially parsed

And it's not dependant on account creation, no. Anon users adding (many) urls can often trigger Captchas

Reedy added a comment.Feb 9 2017, 5:55 PM
#!/bin/bash

files=( /a/mw-log/archive/captcha.log-201*.gz )
for file in "${files[@]}"
do
        filename="${file##*/}"
        filenamenoext="${filename%.*}"
        filedate="${filenamenoext:12}"
        echo $filedate
        zgrep -Ei "ConfirmEdit\: [a-z ]+\;" "$file" | cut -d ';' -f 1 | cut -d ':' -f 5 | sort -n -r | uniq -c
done

Running atm...

Reedy updated the task description. (Show Details)Feb 9 2017, 6:00 PM
Reedy added a comment.Feb 9 2017, 6:57 PM

$log = file( 'captcha-stats.txt' );

$lines = [];
$lines[] = "date, new captcha session, passed, bad form input";

for( $i = 0; $i < count( $log ); $i += 4 ) {
        $date = rtrim( $log[$i] );
        foreach( range( $i + 1, $i + 3 ) as $logEntry ) {
                $which = ltrim( $log[$logEntry] );
                $matches = null;
                preg_match( '/(\d+)  ([a-z]+)/', $which, $matches );
                // $$matches[2] = $matches[1];
                switch( $matches[2] ) {
                        case 'new':
                                $started = $matches[1];
                                break;
                        case 'passed':
                                $passed = $matches[1];
                                break;
                        case 'bad':
                                $bad = $matches[1];
                                break;
                }
        }
        $lines[] = "{$date}, {$started}, {$passed}, {$bad}";
}

file_put_contents( 'captcha-stats.csv', implode( $lines, "\n" ) );

Reedy added a comment.Feb 9 2017, 7:06 PM

So I stuck it in a google docs to make a pretty graph... https://docs.google.com/spreadsheets/d/1cJIKbu-V6IRcY_a8_SVZcxQeKeCWD_wx7NNbmJGySHI/edit?usp=sharing

I'm not really sure this shows us much... But we shall see. Might be more useful when we throw some more captchas into the mix.. And also replace some of the old ones...

Reedy added a comment.Feb 9 2017, 8:33 PM

So we're missing completely out on user rate limiting error messages, due to the comma and fullstop.. Trailing . to go away in https://gerrit.wikimedia.org/r/336880

#!/bin/bash

files=( /a/mw-log/archive/captcha.log-201*.gz )
for file in "${files[@]}"
do
	filename="${file##*/}"
	filenamenoext="${filename%.*}"
	filedate="${filenamenoext:12}"
	echo $filedate
	zgrep -Ei "ConfirmEdit\: [a-z,. ]+\;" "$file" | cut -d ';' -f 1 | cut -d ':' -f 5 | sort -n -r | uniq -c
done

Seems the amount is gonna be a fraction of that of the others...

<?php

$log = file( 'captcha-stats.txt' );

$lines = [];
$lines[] = "date, new captcha session, passed, bad form input, user reached rate limit";

for( $i = 0; $i < count( $log ); ) {
        $started = 0;
        $passed = 0;
        $bad = 0;
        $user = 0;
        $date = rtrim( $log[$i] );
        $i++;
        while( $i < count( $log ) ) {
                $which = ltrim( $log[$i] );
                if ( preg_match( '/^\d{8}$/', $which ) ) {
                        // Looks like a date, next!
                        break;
                }
                $matches = null;
                preg_match( '/(\d+)  ([a-zA-Z]+)/', $which, $matches );
                switch( $matches[2] ) {
                        case 'new':
                                $started = $matches[1];
                                break;
                        case 'passed':
                                $passed = $matches[1];
                                break;
                        case 'bad':
                                $bad = $matches[1];
                                break;
                        case 'User':
                                $user = $matches[1];
                                break;
                }
                $i++;
        }
        $lines[] = "{$date}, {$started}, {$passed}, {$bad}, {$user}";
}

file_put_contents( 'captcha-stats.csv', implode( $lines, "\n" ) );

Reedy added a comment.EditedFeb 9 2017, 8:39 PM

New graph!

Reedy added a parent task: Restricted Task.Feb 16 2017, 9:45 PM

"F5660883 size=full"

Questions from the ignorant...

if we are getting zero reaching the limit, that means either 1) they don't retry enough, or 2) they aren't failing. How do we differentiate, do we have data on the success rate after or or two failures?

Also, "bad form input" if that is a pass captcha, can someone please explain it to me.

For clarity sake, if bots are using the API where are they captured?

Questions from the ignorant...

if we are getting zero reaching the limit, that means either 1) they don't retry enough, or 2) they aren't failing. How do we differentiate, do we have data on the success rate after or or two failures?

Also, "bad form input" if that is a pass captcha, can someone please explain it to me.

For clarity sake, if bots are using the API where are they captured?

FWIW, this isn't all the logging for Captcha stuff. Login uses different login stats, ala https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=7&fullscreen&from=now-7d&to=now and VE editing seems to use something else too.

For where these logging entries come from:
https://github.com/wikimedia/mediawiki-extensions-ConfirmEdit/blob/master/SimpleCaptcha/Captcha.php#L1123
https://github.com/wikimedia/mediawiki-extensions-ConfirmEdit/blob/master/SimpleCaptcha/Captcha.php#L1163-L1178

See also T157735

It wouldn't surprise me if this edit logging is in fact incomplete or even completely wrong. It's hard to tell. All the different Captcha stats in difference places doesn't help.

also... For wgRateLimits

		'badcaptcha' => [ // Bug T92376
			// Mainly for account creation by unregistered spambots.
			// A human probably gives up after a handful attempts to
			// register, but ip/newbie editing needs to be considered too.
			'ip' => [ 15, 60 ],
			'newbie' => [ 15, 60 ],
			// Mainly to catch linkspam bot edits. Account creations by users?
			// Some wikis request tons of captchas to users under 50 edits:
			// the limit needs to be higher than any human can conceivably do.
			'user' => [ 30, 60 ],
		],

IP and newbies can do 15 captchas in 60 seconds. Users 30 in 60. Maybe these are too lenient to be of any actual use?

"Bad form input" is indeed people entering Captchas wrong. Passed is they got it right. New captcha session being as it sounds

I guess we probably should start recording on what attempt (1st, through to 15th) they managed to successfully defeat a capture. In the current logging format that's a bit harder, but if we get the logging overhauled, we can pass it as a parameter etc

We also have no way of knowing if the user didn't try and captchas; ie they gave up -- Not sure how we'd do that unless with the Job Queue or similar after a certain amount of timeout

For the API, not sure. If anything, editing captchas might come into the logging here; login/signup captchas would be in the other stats on grafana. A quick glance at SimpleCaptcha/Captcha.php https://github.com/wikimedia/mediawiki-extensions-ConfirmEdit/blob/master/SimpleCaptcha/Captcha.php#L63-L69 looks like there's no logging, but not to say there isn't logging from elsewhere in the code...

Reedy added a comment.Apr 2 2017, 2:09 PM

Moved to mwlog1001

#!/bin/bash

files=( /srv/mw-log/archive/captcha.log-201*.gz )
for file in "${files[@]}"
do
	filename="${file##*/}"
	filenamenoext="${filename%.*}"
	filedate="${filenamenoext:12}"
	echo $filedate
	zgrep -Ei "ConfirmEdit\: [a-z,. ]+\;" "$file" | cut -d ';' -f 1 | cut -d ':' -f 5 | sort -n -r | uniq -c
done


Billinghurst added a subscriber: MarcoAurelio.EditedJun 13 2017, 12:03 PM

[graph]

thanks @Reedy. Such a low hit rate level.

I am presuming that the spike is spambot activity, and it would be interesting if @MarcoAurelio and stewards could have access to alerts to that sort of data so we could get measures to whack it. As that dirty sinusoidal pattern sort of matches editing patterns, have we looked to see where we have significant differences in the patterns. I am presuming we have an idea of the ratio of IP edits to logged edits, and that presumably has a level of stability. Also presumably we know on a per country level whether we get edits from IP or logged in, and maybe can spot some sort of difference. What sort of statistical data is available in that space?

tstarling added a subscriber: tstarling.EditedJul 5 2017, 4:31 AM

With all the wikis in together, it's hard to distinguish between bots and humans. A more difficult captcha will certainly lead to a rise in the failure rate, but with everything in together, you can't tell whether the failures are more specific to bots than humans.

One possible solution to that is to pull out one subset of the logs which is relatively spammy, and another which is relatively human-dominated. Here are the failure rates for June, broken down by DB suffix:

SuffixPassFailFailure rate
wiki57194617054323%
wikimedia40345253%
wikiversity2191309059%
wiktionary6347905759%
wikivoyage1877268059%
wikinews2566381360%
wikisource2520397061%
wikiquote2114371164%
wikibooks54051163768%

Notable wikis with low and high failure rates:

DBPassFailFailure rate
trwiki155424113.4%
dewiki19837347414.9%
cswiki280652715.8%
enwikinews687156769.5%
enwikibooks3133817172.3%
simplewiktionary12644978.1%
miwiktionary9965186.8%

I used the aggregation script P5672

It's quite interesting to sort all wikis by their failure rate and then to plot the failure rate against cumulative count of total captcha attempts (pass plus fail):

You see that we have a broad plateau at 20% failure rate, presumed to be mostly humans, followed by a sharp rise, presumed to be mostly bots.

If we switched to a different CAPTCHA solution, we would want to see the height of the plateau remain the same, or be reduced. And we want to increase the slope in the bot-dominated part of the graph, around 85-100% cumulative count, so that the failure rate of spam-only wikis approaches 100%.

Krenair added a subscriber: Krenair.Sep 3 2017, 6:28 PM
Tgr added a project: AICaptcha.EditedMar 4 2018, 3:41 AM

We should clean these up a bit so they can serve as a validation of new captcha types:

  • differentiate between registrations which have Javascript and ones which don't (seems like ~90% of spambots do not have Javascript while most user do)
  • also between ones that use the API vs. web interface, and mobile v. desktop (most spambots seem to be using desktop web)
  • store username + captcha success rate in EventLogging so that it can be merged with block logs and editcounts later to get more accurate numbers for spambot vs. productive contributor captcha error rate

then make the data easily available somewhere (grafana? ReportUpdater?)

We should clean these up a bit so they can serve as a validation of new captcha types:

  • differentiate between registrations which have Javascript and ones which don't (seems like ~90% of spambots do not have Javascript while most user do)
  • also between ones that use the API vs. web interface, and mobile v. desktop (most spambots seem to be using desktop web)
  • store username + captcha success rate in EventLogging so that it can be merged with block logs and editcounts later to get more accurate numbers for spambot vs. productive contributor captcha error rate then make the data easily available somewhere (grafana? ReportUpdater?)

Please also don't ignore the appearance of the spambots in Special:log/spamblacklist and Special:Abuselog usually within 1 hour of creation, and say within 24 hours of creation (if we need a maximum endpoint).

Tgr added a comment.Mar 4 2018, 8:06 AM

Yeah, that's what I was trying to get at with the third bullet point.

  • store username + captcha success rate in EventLogging so that it can be merged with block logs and editcounts later to get more accurate numbers for spambot vs. productive contributor captcha error rate

This helps for those who get past the captcha, but what data to collect on those who fail the captcha, to assess how many of them were legit users? At a minimum, I'd suggest making sure it's possible and easy to monitor anomalous variations per language, per country and per project. If, say, Japanese users start failing much less/more than UK users, then maybe something good/bad happened for non-Latin script or non-English speaking users.

chasemp triaged this task as Normal priority.Tue, Sep 4, 4:08 PM
chasemp moved this task from Backlog to In Progress on the Security-Team board.Tue, Sep 4, 4:39 PM