Background
We are building a user account reputation score as a signal to be used by anti-abuse features (e.g. rate limits, CAPTCHAs, etc).
As outlined in T370895: Build a first draft of a user account reputation score calculation, we will use a set of data points, weighted and combined to produce a continuous score. We will need to be able to compare different weights and different ways of combining the scores in order to improve the accuracy of the score. In T371876: Make a labelled dataset for analysing account reputation score we will make a ground truth dataset.
Treating this as a binary classification problem, we will also need an analysis pipeline that can:
- take our dataset as an input
- classify the accounts into good-faith and bad-faith
- tell us how accurate the classification was
We may want to be able to run the pipeline in a Jupyter notebook. We may want to make the pipeline modular so we can compare different classifiers.
What we need to define
Classifier
Our plan in T370895: Build a first draft of a user account reputation score calculation defines the classifier as a set of weights to multiply by the data points for each user, and a way to combine these weights that will produce a continuous, numeric score.
We could do something simple like summing all the data points into one number (perhaps finding a way to transform that into a percentage), or something more complex like using one of the common machine learning methods (neural networks, etc). This may depend on how much we trust our dataset/model, and how transparent our users want the generation of the score to be.
Score
We'll need a score for evaluating the success of the classifier (e.g. F-score, etc). We may want to choose the score depending on whether precision (not mis-classifying good-faith users) or recall (finding all the bad-faith users) is more important.
Threshold
The score will depend on the exact threshold used for classification, so each run of the pipeline could output a score using varying thresholds.
Once we have a score whose accuracy we are confident in, we can define more buckets than just good-faith and bad-faith for real-world usage.
Scope of this task
- Build a pipeline that can be run repeatedly in order to compare the accuracy of different ways of producing an account reputation score