**This card is done when** we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).
Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias
== Methodology ==ies ==
Not necessarily in chronological order. These ways weWe might want to approach the problem of systematic bias in ORES from several perspectives.
=== Literature review ===
Suggested work: Build knowledge that helps us see the general problem boundaries, and where we're excluding hypotheses.
* How to do a more systematic review?
http://tests-zingarelli.googlecode.com/svn-history/r336/trunk/2-Artigos-Projeto/Revisao-Sistematica/Kpeople.ucalgary.ca/~medlibr/kitchenham-Systematic-Review-m_2004.pdf
systematic review handbook for SW engineersProcedures for Performing Systematic Reviews
* What is known about systemic bias in research? Handbook for software engineers to make unbiased surveys of existing literature.
* How are the hypotheses selected* What is known about systemic bias in research?
* Classification* Find writing about how hypotheses, classes, and labels are selectioned.
=== Identify our a priori assumptions ===
Suggested work: Pick apart our research plan. Find alternative hypotheses that should be tested.
What are our hypotheses? Which are we knowingly including and excluding? Which hypotheses are we not discriminating between? Call out prejudices and gaps, and explicitly list them. Try to state these in terms of hypotheses as they relate to the experiment.
> _Key_: (AI) Already Included, the hypothesis is tested by our experiment. (NT) Needs Testing, we should add a feature or other design which lets us discriminate whether this hypothesis might be true. (WNT) Will Not Test, we've considered it and have decided not to pursue.
* (AI) Edit acceptance by the editor community may be related to the number of times curse words are used.
* (AI) Edit acceptance may be related to the number of times informal or familiar words are used.
* The* (AI) The editor's state of mind matters to what they are writing. The reverting editor's state of mind also matters.
- (AI) We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
- (NT) We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
- (NT) We should correlate against other editor observable properties, such as pace of work.
- (NT) We should look at correlations against other editor internal state, like their mood as seen through their own writing, and prior behavior.
- (NT) Editor has a self-perception about their edit, this is its own set of classifications.
* (AI) The revert was a correct decision, that we might want to emulate. (NT) Offer alternative hypotheses.
- (NT) Some reverts are helpful but others should have been avoided.
- Offer alternative hypotheses to this. Some reverts- (NT) Revert is a multiclass decision that we are helpful but others should have been avoidedrounding off to a binary.
- (NT) We care about drama caused by a revert. Measure it, and its effect on retention. Drama must be hand-keyed by a third party, looks like self-reported will be a bust.
- (NT) Since reverted is a subjective decision, e.g. it can be for cause or notwe are perpetuating all existing biases. Testing this bias is easy: train one model on 2004 and another on 2014, we are perpetuating all biasesthen compare their predictions.
* Should give editors a "why" menu(NT) Editors should be given a "why" menu for labeling and self-reporting their motivation. This is one way to discriminate a multiclass.
* And split a revert decision into(WNT) In the editor UI, revert decisions could be split into verdict and sentencing workflows, configured as a peer review mechanism for example. This is a reasonable way to begin unpacking revert, verdict and sentencingthat they are initiated due to an editor's judgement of revision quality, which could be reviewed by a third editorthen there is a second decision whether to take the revert action.
* (AI) The wp10 scale is helpful, and is used correctly on average. Not proposing any alternatives to this opinion(WNT) We're not proposing any alternatives.
- (NT) Normalize the scale by the number of articles in each class, if it helps to make the classifier's output more linear.
* Many more hypotheses about ORES can be eliminated by beingexplicitly eliminated as obviously wrong, but we need to do thilook through the different hypothesis spaces systematically.
Is the training data biased?
* Yes, it's recent.
* The human outcomes are biasedIt's being fed to us by another process (?), which is not necessarily a random sample. Is this more instructive than a sample? [1]
* Revert decisions are made by biased humans.
* The revert outcomes are inappropriately mushed into a binary. Probably, WP10 is not well distributed, either. Should we try to expand to a continuous scale, and then look at the windows covered by each wp10 category. Or does that go against classification? Oof, we'd have to give each category a place on the scaleRevert is coded as a binary, which will make a nonlinear space. Unless we normalize by the number that fall into each group?but we could add classes and more dimensions.
How is the choice of modelclassifier algorithm a bias?
* We've chosen supervised learning, in which we define inputs (causality), the set of classifications, and encoded some norms via choice of training data and featurtraining data, and features. We have encoded norms and assumptions in all of these design parameters, and determining the hypotheses.
Are the chosen classifications biased?
* Yes. They are defined by norms. One could argue that this is unbiased, but as norms change, the biases will be revealedthey grew out of existing norms. Compare a training set from the first few years of WP to theSee how norms captured in recent datahange.
* Can weour classifier abstain from judgement? I read that probabilistic classification is good at that. If so, then we should actually have three categories, (reverted, staysnot reverted, and unknowndecided).
What is our motivation and objectives?what are the objectives?
* As the metawiki project page explains, to provide a standardized scoring service, so bots are easier to port across wikis, easier to write and support, and new tools can flourish.
* Provide a library and a service.
* Make further research easier by publishing data sets and tools.
Identify the methodology thatused to builtd the existing system.
=== Does our model causally match reality? ===How are badwords and other reference data created?
http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node33.html - Start with abusefilter dump or other existing lists: e.g. [[ https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/2014#Badword_lists | m:Research talk:Revision scoring as a service/2014#Badword lists ]]
- Native speaker hand codes.
=== Does our model match the real causality? ===
Suggested work: Model each causal step, write features and models to cover all inputs, and to infer things about latent variables.
What is causal structure of the model? Make sure we are providing all the available inputs.[2]
* Original edit:
- Inputs: state of article(s), identity of author, language, local time
- Latent: mood of author, experience of author, sources available
- Outputs: textual delta, time of edit
* Revert:
- Inputs: delta, initial (and final) state of article, identity of author, identity of editor, local time
- Latent: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
- Outputs: did revert? Stated reason for revert. Time of action.
* WP10:
- Inputs: current state of article, identity of judge(s), current norms
- Latent: quality of other articles'
- Outputs: article class, time of judgement
* Our scoring:
- Inputs: article and editor metadata as above, reference data, e.g. badwords. Choice of training data, reference data: badwordschoice of model
- Latent: Choice of training data, choice of model
- Outputs: article class or score, model revision
* Cluster analysis (the unsupervised analogue of classification)=== Statistically estimating what we don't know ===
Suggested work: Get diagnostic information about the health of our classifiers. Maybe we should run one to see how our labelCompare alternative models and output classes.
Looks line up with the centroids of impartially chosen classeske scoring_model.test already does some cross-validation on the model. For examplAlso evaluate bias from every test point [3], plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes [4], use the resulting learning curve to determine if we have high bias or variance, crude vandalism would stand out as its own clustermeaning we need to adjust hypotheses.
=== Statistically computing our bias and variance ===Do cluster analysis (unsupervised classification), and compare against our classes. See how our labels line up with the centroids of impartially chosen classes. For example, crude vandalism would stand out as its own cluster, but its centroid does not match reverted well.
Looks like scoring_model.test already does some of this. Elaborate with the functions below. We're looking for diagnostics about the health of our classifiers.=== Notes ===
Evaluate bias from every test point [1], plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes [2], use the resulting learning curve to determine if we have high bias or variance, meaning we need to adjust hypotheses.[1] http://www.ncbi.nlm.nih.gov/pubmed/8329602
[12] http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node33.html
[3] http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf
Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995.
[2[4] https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/
* Compare training bias and training error. Check learning curves.