Page MenuHomePhabricator

Make a pipeline for optimising account reputation score
Open, Needs TriagePublic

Description

Background

We are building a user account reputation score as a signal to be used by anti-abuse features (e.g. rate limits, CAPTCHAs, etc).

As outlined in T370895: Build a first draft of a user account reputation score calculation, we will use a set of data points, weighted and combined to produce a continuous score. We will need to be able to compare different weights and different ways of combining the scores in order to improve the accuracy of the score. In T371876: Make a labelled dataset for analysing account reputation score we will make a ground truth dataset.

Treating this as a binary classification problem, we will also need an analysis pipeline that can:

  • take our dataset as an input
  • classify the accounts into good-faith and bad-faith
  • tell us how accurate the classification was

We may want to be able to run the pipeline in a Jupyter notebook. We may want to make the pipeline modular so we can compare different classifiers.

What we need to define

Classifier

Our plan in T370895: Build a first draft of a user account reputation score calculation defines the classifier as a set of weights to multiply by the data points for each user, and a way to combine these weights that will produce a continuous, numeric score.

We could do something simple like summing all the data points into one number (perhaps finding a way to transform that into a percentage), or something more complex like using one of the common machine learning methods (neural networks, etc). This may depend on how much we trust our dataset/model, and how transparent our users want the generation of the score to be.

Score

We'll need a score for evaluating the success of the classifier (e.g. F-score, etc). We may want to choose the score depending on whether precision (not mis-classifying good-faith users) or recall (finding all the bad-faith users) is more important.

Threshold

The score will depend on the exact threshold used for classification, so each run of the pipeline could output a score using varying thresholds.

Once we have a score whose accuracy we are confident in, we can define more buckets than just good-faith and bad-faith for real-world usage.

Scope of this task
  • Build a pipeline that can be run repeatedly in order to compare the accuracy of different ways of producing an account reputation score