**This card is done when** we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).
Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias
== Methodology ==
Not necessarily in chronological order. These ways we might want to approach the problem of systematic bias in ORES.
=== Literature review ===
* How to do a more systematic review?
systematic review handbook for SW engineers
* What is known about systemic bias in research?
* How are the hypotheses selected?
* Classification and label selection.
=== Identify our a priori assumptions ===
What are our hypotheses? Which are we knowingly including and excluding? Which hypotheses are we not discriminating between? Call out prejudices and gaps, and explicitly list them. Try to state these in terms of hypotheses as they relate to the experiment.
* Edit acceptance by the editor community may be related to the number of times curse words are used.
* Edit acceptance may be related to the number of times informal or familiar words are used.
* The editor's state of mind matters.
- We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
- We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
- We should correlate against other editor observable properties, such as pace of work.
- We should look at correlations against other editor internal state, like their mood as seen through their own writing, and prior behavior.
- Editor has a self-perception about their edit, this is its own set of classifications.
* The revert was a correct decision, that we might want to emulate.
- Offer alternative hypotheses to this. Some reverts are helpful but others should have been avoided.
- We care about drama caused by a revert. Measure it, and its effect on retention. Drama must be hand-keyed by a third party, looks like self-reported will be a bust.
- Since reverted is a subjective decision, e.g. it can be for cause or not, we are perpetuating all biases.
* Should give editors a "why" menu.
* And split a revert decision into, verdict and sentencing, which could be reviewed by a third editor.
* The wp10 scale is helpful, and is used correctly on average. Not proposing any alternatives to this opinion.
* Many hypotheses about ORES can be eliminated by being obviously wrong, but we need to do this systematically.
Is the training data biased?
* Yes, it's recent.
* The human outcomes are biased.
* The revert outcomes are inappropriately mushed into a binary. Probably, WP10 is not well distributed, either. Should we try to expand to a continuous scale, and then look at the windows covered by each wp10 category. Or does that go against classification? Oof, we'd have to give each category a place on the scale, which will make a nonlinear space. Unless we normalize by the number that fall into each group?
How is the choice of model algorithm a bias?
* We've chosen supervised learning, in which we define inputs (causality), the set of classifications, and encoded some norms via choice of training data and features.
Are the chosen classifications biased?
* Yes. They are defined by norms. One could argue that this is unbiased, but as norms change, the biases will be revealed. Compare a training set from the first few years of WP to the norms captured in recent data.
* Can we abstain from judgement? I read that probabilistic classification is good at that. If so, then we should actually have three categories, (reverted, stays, and unknown).
What is our motivation and objectives?
Identify the methodology that built the existing system.
=== Does our model causally match reality? ===
What is causal structure of the model? Make sure we are providing all the available inputs.
* Original edit:
- Inputs: state of article(s), identity of author, language
- Latent: mood of author, experience of author, sources available
- Outputs: textual delta, time of edit
- Inputs: delta, initial (and final) state of article, identity of author, identity of editor
- Latent: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
- Outputs: did revert? Stated reason for revert. Time of action.
- Inputs: current state of article, identity of judge(s), current norms
- Latent: quality of other articles'
- Outputs: article class, time of judgement
* Our scoring:
- Inputs: article and editor metadata, reference data: badwords
- Latent: Choice of training data, choice of model
- Outputs: article class or score, model revision
* Cluster analysis (the unsupervised analogue of classification). Maybe we should run one to see how our labels line up with the centroids of impartially chosen classes. For example, crude vandalism would stand out as its own cluster.
=== Statistically computing our bias and variance ===
Looks like the scoring_model.test function already starts to do this. Elaborate with the functions below.
Evaluate bias from every test point , plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes , use the resulting learning curve to determine if we have high bias or variance, meaning we need to adjust hypotheses.
Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995.
* Compare training bias and training error. Check learning curves.