Background
We are building a user account reputation score as a signal to be used by anti-abuse features (e.g. rate limits, CAPTCHAs, etc).
As outlined in T370895: Build a first draft of a user account reputation score calculation, we will use a set of data points, weighted and combined to produce a continuous score. We will need to be able to compare different weights and different ways of combining the scores in order to improve the accuracy of the score. In order to make these comparisons we will need a ground truth dataset.
Dataset generation
At the simplest, we could define this as a binary classification problem, where users are classified into good-faith users and bad-faith users. In that case, we'd need a labelled dataset of good-faith and bad-faith accounts.
This dataset should be as large as feasible, and contain real accounts. It would therefore be most practical to find a way to automatically define the datasets (e.g. infinitely blocked accounts might be bad-faith, sysops might be good-faith).
Each account in the dataset would be represented by a vector of the data points (defined in the project overview) and a label "bad-faith" or "good-faith".
Questions about privacy:
- Should we keep the dataset private?
- Should we anonymize the accounts?
Questions about data to include:
- Should we combine accounts from different wikis in one dataset?
- Should we include IP users or temporary users?
Scope of this task
- Define how to generate a bad-faith users and a good-faith users
- Generate the dataset