Page MenuHomePhabricator

Make a labelled dataset for analysing account reputation score
Open, Needs TriagePublic

Description

Background

We are building a user account reputation score as a signal to be used by anti-abuse features (e.g. rate limits, CAPTCHAs, etc).

As outlined in T370895: Build a first draft of a user account reputation score calculation, we will use a set of data points, weighted and combined to produce a continuous score. We will need to be able to compare different weights and different ways of combining the scores in order to improve the accuracy of the score. In order to make these comparisons we will need a ground truth dataset.

Dataset generation

At the simplest, we could define this as a binary classification problem, where users are classified into good-faith users and bad-faith users. In that case, we'd need a labelled dataset of good-faith and bad-faith accounts.

This dataset should be as large as feasible, and contain real accounts. It would therefore be most practical to find a way to automatically define the datasets (e.g. infinitely blocked accounts might be bad-faith, sysops might be good-faith).

Each account in the dataset would be represented by a vector of the data points (defined in the project overview) and a label "bad-faith" or "good-faith".

Questions about privacy:

  • Should we keep the dataset private?
  • Should we anonymize the accounts?

Questions about data to include:

  • Should we combine accounts from different wikis in one dataset?
  • Should we include IP users or temporary users?
Scope of this task
  • Define how to generate a bad-faith users and a good-faith users
  • Generate the dataset

Event Timeline

This dataset should be as large as feasible

Perhaps the Research Team could help us determine what size of dataset would be big enough for adequate statistical accuracy.

For "ground truth" data collection, one thing to be careful about is the nature of the good or bad faith data.

For example, we would not want most good faith to be sysops account - otherwise, the algorithm we use may start to "cheat" on the nature of being sysops, and other type of good faith accounts may not be effectively identified.

It would also be very helpful, if we have a field in the "ground truth" dataset to indicate the types of the accounts, e.g. the reason (category) of being either good or bad faith. But this can be much harder. Not a must-have.