For experimenting with machine learning based captchas, the current plan is to collect three streams of data:
* keyboard / mouse dynamics via client-side event logging with the [[https://meta.wikimedia.org/wiki/Schema:InputDeviceDynamics|InputDeviceDynamics]] schema
* information on the properties of the registration via server-side event logging with the [[https://meta.wikimedia.org/wiki/Schema:ServerSideAccountCreation|ServerSideAccountCreation]] schema
* information on spamming behavior via publicly available MediaWiki logs (block log etc.)
Since the experimentation will probably happen on a Cloud VPS machine, we want to anonymize the data before it leaves stat1006. The anonymization plan is at follows:
* join the InputDeviceDynamics and block log data streams by username (and whatever other means necessary)
* discard all generic EventLogging data (ie. the fields not starting with `event_`)
* discard user names
* randomize row order to prevent correlating known registrations with unknown ones
(ServerSideAccountCreation is only used to filter out some irrelevant registrations such as the ones happening via API).
So the anonymized data set consists of all the statistical fields of InputDeviceDynamics (dwell time statistics, flight time statistics, mouse movement statistics, click timing statistics) plus a "did the community identify this user as a spambot" flag.