This card is done when we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).
Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias
Not necessarily in chronological order. We might want to approach the problem of systematic bias in ORES from several perspectives.
Suggested work: Build knowledge that helps us see the general problem boundaries, and where we're excluding hypotheses.
- How to do a more systematic review? http://people.ucalgary.ca/~medlibr/kitchenham_2004.pdf Procedures for Performing Systematic Reviews Handbook for software engineers to make unbiased surveys of existing literature.
- What is known about systemic bias in research?
- Find writing about how hypotheses, classes, and labels are selected.
Identify our a priori assumptions
Suggested work: Pick apart our research plan. Find alternative hypotheses that should be tested.
What are our hypotheses? Which are we knowingly including and excluding? Which hypotheses are we not discriminating between? Call out prejudices and gaps, and explicitly list them. Try to state these in terms of hypotheses as they relate to the experiment.
(AI) Already Included, the hypothesis is tested by our experiment.
(NT) Needs Testing, we should add a feature or other design elements which help us discriminate whether this hypothesis might be true.
(WNT) Will Not Test, we've considered it but have decided not to pursue.
- (AI) Edit acceptance by the editor community may be related to the number of times curse words are used.
- (AI) Edit acceptance may be related to the number of times informal or familiar words are used.
- (AI) The editor's state of mind matters to what they are writing. The reverting editor's state of mind also matters.
- (AI) We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
- (NT) We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
- (NT) We should correlate against other editor observable properties, such as pace of work.
- (NT) We should look at correlations against other editor internal state, like their mood as seen through their own writing, and prior behavior.
- (NT) Editor has a self-perception about their edit, this is its own set of classifications.
- (AI) The revert was a correct decision, that we might want to emulate. (NT) Offer alternative hypotheses.
- (NT) Some reverts are helpful but others should have been avoided.
- (NT) Revert is a multiclass decision that we are rounding off to a binary.
- (NT) We care about drama caused by a revert. Measure it, and its effect on retention. Drama must be hand-keyed by a third party, looks like self-reported will be a bust.
- (NT) Since reverted is a subjective decision, we are perpetuating all existing biases. Testing this bias is easy: train one model on 2004 and another on 2014, then compare their predictions.
- (NT) Editors should be given a "why" menu for labeling and self-reporting their motivation. This is one way to discriminate a multiclass.
- (WNT) In the editor UI, revert decisions could be split into verdict and sentencing workflows, configured as a peer review mechanism for example. This is a reasonable way to begin unpacking revert, that they are initiated due to an editor's judgement of revision quality, then there is a second decision whether to take the revert action.
- (AI) The wp10 scale is helpful, and is used correctly on average. (WNT) We aren't proposing any alternatives, yet.
- (NT) Normalize the wp10 scale by the number of articles in each class. Does it reduce any dimensions if we make the classifier's output more linear?
- Many more hypotheses about ORES can be explicitly eliminated as obviously wrong, but we need to look through the different hypothesis spaces systematically.
Is the training data biased?
- Yes, it's recent.
- It's being fed to us by another process (?), which is not necessarily a random sample. Is this more instructive than a sample? 
- Revert decisions are made by biased humans.
- Revert is coded as a binary, but we could add classes and more dimensions.
How is the choice of classifier algorithm a bias?
- We've chosen supervised learning, in which we define inputs (causality), the set of classifications, training data, and features. We have encoded norms and assumptions in all of these design parameters, and determining the hypotheses.
Are the chosen classifications biased?
- Yes, they grew out of existing norms. See how norms change.
- Can our classifier abstain from judgement? I read that probabilistic classification is good at that. If so, then we should actually have three categories, (reverted, not reverted, and undecided).
- Notice that e.g. ScoredRevisions has a thresholds below which the revisions are not highlighted as "likely needing revert"
What is our motivation and what are the objectives?
- As the metawiki project page explains, to provide a standardized scoring service, so bots are easier to port across wikis, easier to write and support, and new tools can flourish.
- Provide a library and a service.
- Make further research easier by publishing data sets and tools.
Identify the methodology used to build the existing system.
How are badwords and other reference data created?
- Start with AbuseFilter dump or other existing lists. E.g.:
- Native speaker hand codes.
Does our model match the real causality?
Suggested work: Model each causal step, write features and models to cover all inputs, and to infer things about latent variables.
What is the causal structure of our model? 
- Original edit:
- Inputs: state of article(s), identity of author, language, local time
- Latent: mood of author, experience of author, sources available
- Outputs: textual delta, time of edit
- Inputs: delta, initial (and final) state of article, identity of author, identity of editor, local time
- Latent: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
- Outputs: did revert? Stated reason for revert.
- Inputs: current state of article, identity of judge(s), current norms
- Latent: quality of other articles
- Outputs: article class, time of judgement
- Our scoring:
- Inputs: article and editor metadata as above, reference data, e.g. badwords. Choice of training data, choice of model
- Latent: model
- Outputs: article class or score, model revision
Statistically estimating what we don't know
Suggested work: Get diagnostic information about the health of our classifiers. Compare alternative models and output classes.
Looks like scoring_model.test already does some cross-validation on the model. Also evaluate bias from every test point , plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes , use the resulting learning curve to determine if we have high bias or variance, meaning we need to adjust hypotheses.
Do cluster analysis (unsupervised classification), and compare against our classes. See how our labels line up with the centroids of impartially chosen classes. For example, crude vandalism would stand out as its own cluster, but its centroid does not match reverted well.
Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995.