Page MenuHomePhabricator

Data anonimization plan for AICaptcha data
Closed, ResolvedPublic

Description

For experimenting with machine learning based captchas, the current plan is to collect three streams of data:

  • keyboard / mouse dynamics via client-side event logging with the InputDeviceDynamics schema
  • information on the properties of the registration via server-side event logging with the ServerSideAccountCreation schema
  • information on spamming behavior via publicly available MediaWiki logs (block log etc.)

Since the experimentation will probably happen on a Cloud VPS machine, we want to anonymize the data before it leaves stat1006. (To be clear there are no plans to publish the data, these are precautions due to Cloud VPS beind a low-security environment.) The anonymization plan is at follows:

  • join the InputDeviceDynamics and block log data streams by username (and whatever other means necessary)
  • discard all generic EventLogging data (ie. the fields not starting with event_)
  • discard user names
  • randomize row order to prevent correlating known registrations with unknown ones (and make sure there are a lot of rows)

(ServerSideAccountCreation is only used to filter out some irrelevant registrations such as the ones happening via API).

So the anonymized data set consists of all the statistical fields of InputDeviceDynamics (dwell time statistics, flight time statistics, mouse movement statistics, click timing statistics) plus a "did the community identify this user as a spambot" flag.

Event Timeline

Tgr created this task.Feb 11 2018, 5:23 AM
Tgr updated the task description. (Show Details)
Tgr added a subscriber: Nuria.

@Nuria does this sound reasonable?

Tgr updated the task description. (Show Details)Feb 11 2018, 5:25 AM
Nuria added a comment.Feb 13 2018, 8:00 PM

Since the experimentation will probably happen on a Cloud VPS machine, we want to anonymize the data before it leaves stat1006.

I do not understand this. If tests are on labs how is data making it to stats1006?

Tgr added a comment.Feb 13 2018, 8:03 PM

It's collected via EventLogging in production. By "experimentation" I meant experimenting with the training of various classifiers (we probably don't want to do that on stat1006).

So the anonymized data set consists of all the statistical fields of InputDeviceDynamics (dwell time statistics, flight time statistics, mouse movement statistics, click timing statistics) plus a "did the community identify this user as a spambot" flag.

This set of fields seems fine to keep.

Tgr closed this task as Resolved.Feb 23 2018, 3:52 AM
Tgr claimed this task.

This was agreed on, nothing more to do here for now. Can be reopened if the data schema changes.