Page MenuHomePhabricator

Collect AICaptcha data from WikimediaEvents extension
Closed, ResolvedPublic

Description

Building on T183866: Stand-alone data collection, collect real user (or bot) data from the MediaWiki registration page with the WikimediaEvents extension. Using this data and getting labels (bots or humans) can then be used for training a machine learning classifier.

  • get patches ready
  • notify someone from Legal
  • notify someone from Security
  • notify someone from Performance (due to an OOUI dependency being involved) T185870
  • notify translators
  • make sure there is no significant negative impact on registrations T185870
  • figure out UI for mobile

Event Timeline

Groovier created this task.Jan 1 2018, 6:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 1 2018, 6:28 PM
Groovier moved this task from Backlog to Doing on the AICaptcha board.Jan 1 2018, 6:28 PM
Tgr updated the task description. (Show Details)Jan 1 2018, 7:46 PM
Legoktm renamed this task from WikiMedia Events extension for data collection to WikimediaEvents extension for data collection.Jan 2 2018, 1:03 AM
Tgr added a subscriber: Tgr.Jan 4 2018, 9:34 AM

We should probably be nice and not collect anything if navigator.doNotTrack is set.

Tgr updated the task description. (Show Details)Jan 21 2018, 12:32 AM
Tgr updated the task description. (Show Details)Jan 21 2018, 1:21 AM

Change 403622 had a related patch set uploaded (by Groovier1; owner: Groovier1):
[mediawiki/extensions/WikimediaEvents@master] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/403622

Change 404910 had a related patch set uploaded (by Gergő Tisza; owner: Groovier1):
[operations/mediawiki-config@master] Adding config for WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/404910

Change 403622 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/403622

Tgr added a comment.Jan 29 2018, 4:49 AM

Some minor issues (that can be fixed in a separate patch):

  • the "find out more" link should probably open in a new window
  • submitting the data should happen in the submit handler, not the click handler (to make sure it happens after other checks that might prevent submit - e.g. an invalid email address)
  • the schema description should specify what units the various things are measured in
  • if there is an error on first submit (e.g. the contents of the two password fields are different) the user will submit the form twice, and the stats for the second submit will be abnormal (as all the fields are filled out already). Not sure if something can (or should) be done about that.
Tgr updated the task description. (Show Details)Jan 29 2018, 5:54 AM
Tgr added a subscriber: Halfak.Jan 29 2018, 6:23 AM

As discussed earlier, we should differentiate between account creations which happen through the API and ones which happen through the web interfaces but no behavior data is collected (e.g. not a human or has Javascript disabled). Probably the easiest way to do that is to add a flag to the ServerSideAccountCreation schema (ping @Halfak), which is recorded in the Campaigns extension.

Tgr updated the task description. (Show Details)Jan 29 2018, 8:28 AM

Change 404910 merged by jenkins-bot:
[operations/mediawiki-config@master] Adding config for WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/404910

Mentioned in SAL (#wikimedia-operations) [2018-01-29T19:19:30Z] <niharika29@tin> Synchronized wmf-config/InitialiseSettings-labs.php: Adding config for WikimediaEvents module for logging behaviour data T183869 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2018-01-29T19:22:02Z] <niharika29@tin> Synchronized wmf-config/InitialiseSettings.php: Adding config for WikimediaEvents module for logging behavior data T183869 and beaureaucrats to add and remove accountcreator by default T185417 (duration: 00m 56s)

Tgr added a comment.Jan 29 2018, 10:09 PM

EventLogging seems broken in beta. Filed T185952: EventLogging broken in beta about that.

Tgr added a comment.Jan 29 2018, 10:19 PM

The patch adds ~50K (after gzip) Javascript to the registration page (since it pulls in OOUI core for the popup widget), which means almost doubling the size of the page. @Krinkle any concerns about that? I assumed it's not a big deal since it has to be done sooner or later anyway due to T85853: Convert MW core login/create account pages to OOUI (Special:UserLogin / Special:CreateAccount). Are there any additional metrics to track due to that, in addition to the ones mentioned in T185870: Monitor registration rates to make sure captcha changes have no negative effects?

(For now, the patch is only live on Beta. Also, I imagine we won't show the popup on mobile, since the mobile registration interface is very minimalistic on small screens and hides everything not absolutely needed for signing up.)

Tgr updated the task description. (Show Details)Jan 30 2018, 7:40 PM
elukey changed the status of subtask T185952: EventLogging broken in beta from Open to Stalled.Jan 30 2018, 11:28 PM

Change 407045 had a related patch set uploaded (by Groovier1; owner: Groovier1):
[mediawiki/extensions/WikimediaEvents@master] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/407045

Change 407045 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/407045

Tgr updated the task description. (Show Details)Feb 2 2018, 12:58 AM
Tgr added a comment.Feb 2 2018, 1:26 AM

The new designs place the text/button in the capctha block which might be missing. If there is no captcha (I think that only happens when an admin is creating an account for someone else) we should skip the logging altogether. If necessary, such registrations can be identified later via the isSelfMade field of the ServerSideAccountCreation schema.

Krinkle renamed this task from WikimediaEvents extension for data collection to Collect AICaptcha data from WikimediaEvents extension.Feb 2 2018, 8:06 PM

Change 407989 had a related patch set uploaded (by Groovier1; owner: Groovier1):
[mediawiki/extensions/Campaigns@master] Adding isApi field to log if the account is created using API.

https://gerrit.wikimedia.org/r/407989

Change 407989 merged by jenkins-bot:
[mediawiki/extensions/Campaigns@master] Adding isApi field to log if the account is created using API.

https://gerrit.wikimedia.org/r/407989

Change 408351 had a related patch set uploaded (by Gergő Tisza; owner: Groovier1):
[mediawiki/extensions/Campaigns@wmf/1.31.0-wmf.17] Adding isApi field to log if the account is created using API.

https://gerrit.wikimedia.org/r/408351

Change 408352 had a related patch set uploaded (by Gergő Tisza; owner: Groovier1):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.17] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/408352

Change 408358 had a related patch set uploaded (by Gergő Tisza; owner: Groovier1):
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.17] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/408358

Change 408351 merged by jenkins-bot:
[mediawiki/extensions/Campaigns@wmf/1.31.0-wmf.17] Adding isApi field to log if the account is created using API.

https://gerrit.wikimedia.org/r/408351

Change 408352 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.17] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/408352

Change 408358 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.31.0-wmf.17] Adding WikimediaEvents module for logging behaviour data

https://gerrit.wikimedia.org/r/408358

TheDJ added a subscriber: TheDJ.EditedFeb 20 2018, 8:53 PM

I was unaware of this project and I have two questions:
1: can we use this to finally solve our problem with the accessibility of captchas for blind users ? T6845
2: the machine learning part, can we make sure we not only identify ‘normal’ users, but also keyboard navigation and screenreader users, or at least be careful to not mark them as bots ?

Tgr added a comment.Feb 21 2018, 12:16 AM

1: can we use this to finally solve our problem with the accessibility of captchas for blind users ?

If the machine learning part works well, the captcha can be made into an optional step, which is only shown to users that don't seem human. That would help blind users as well.

2: the machine learning part, can we make sure we not only identify ‘normal’ users, but also keyboard navigation and screenreader users, or at least be careful to not mark them as bots ?

Is there a way to get data from such users specifically?
My hope is that the typing part (since everyone has to type during registration) is similar enough in all three cases and dissimilar enough from bots that all three get grouped together by the classifier. But we don't have any way currently to check if that really happens.

TheDJ added a comment.Feb 21 2018, 1:55 AM

Is there a way to get data from such users specifically?
My hope is that the typing part (since everyone has to type during registration) is similar enough in all three cases and dissimilar enough from bots that all three get grouped together by the classifier. But we don't have any way currently to check if that really happens.

We could reach out to some groups I guess. I expect them to have widely different but still rather recognizable patterns.

  1. Mouse control by eye movement (rapid, erratic)
  2. Keyboard control by blind users (rapid navigation between controls, 0 or only accidental mouse usage [touch devices can differ here perhaps])
  3. keyboard control by alternative input devices (probably rather slow ui operation)
  4. for typing specifically, we might see text-to-speech or password manager style behavior, or slower spelling.

Some observations from the data collected using InputDeviceDynamics Schema and ServerSideAccountCreation schema from 09th Feb 2018 to 21st Feb 2018 is recorded below:

  1. 91318 registrations were by users (may contain bots but no such activity identified yet)
  2. 1091 accounts have been labeled as spambots by spam detection team.
  3. Out of these we could collect registration data of 126 bots.

Change 414742 had a related patch set uploaded (by Groovier1; owner: Groovier1):
[mediawiki/extensions/WikimediaEvents@master] Modify AICaptcha data collection code

https://gerrit.wikimedia.org/r/414742

Krinkle removed a subscriber: Krinkle.Mar 3 2018, 2:03 AM
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.EditedJul 31 2018, 5:27 PM
Krinkle added a subscriber: LGoto.

@Tgr @LGoto Since January, the AICaptcha metric collection module has been registered in production on all wikis.

It's also loaded on Special:CreateAccount.

Is there an end date for this campaign? The project appears stalled or postponed. If that is the case, could we deregister the WikimediaEvents module until it is needed?

Tgr closed this task as Resolved.Jan 10 2019, 8:14 PM
Tgr added a subscriber: Krinkle.

@Krinkle sorry for the lack of response. Loading of the JS code was disabled in September. The module has been deregistered last week.

The first iteration of the project concluded in June(ish), which was the official end date. The collected data was not useful for bot identification, but gave us pointers on what to do differently (see T188067: Move input device dynamics logging to backend). A second iteration did not happen, due to lack of time on everyone's behalf. So marking this resolved. (If somebody is interested, feel free to reopen.) Thanks again to @Groovier for all your work on this! I think this was useful exploration, it's just that the captcha problem is not top priority ATM.