For experimenting with machine learning based captchas, the current plan is to collect three streams of data:
- keyboard / mouse dynamics via client-side event logging with the InputDeviceDynamics schema
- information on the properties of the registration via server-side event logging with the ServerSideAccountCreation schema
- information on spamming behavior via publicly available MediaWiki logs (block log etc.)
Since the experimentation will probably happen on a Cloud VPS machine, we want to anonymize the data before it leaves stat1006. (To be clear there are no plans to publish the data, these are precautions due to Cloud VPS beind a low-security environment.) The anonymization plan is at follows:
- join the InputDeviceDynamics and block log data streams by username (and whatever other means necessary)
- discard all generic EventLogging data (ie. the fields not starting with event_)
- discard user names
- randomize row order to prevent correlating known registrations with unknown ones (and make sure there are a lot of rows)
(ServerSideAccountCreation is only used to filter out some irrelevant registrations such as the ones happening via API).
So the anonymized data set consists of all the statistical fields of InputDeviceDynamics (dwell time statistics, flight time statistics, mouse movement statistics, click timing statistics) plus a "did the community identify this user as a spambot" flag.