**This card is done when** we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).
Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias
Very embarrassing draft notes:
== Methodology ==
Not necessarily in chronological order. These are the main avenues of planned activity.
0) Literature review.
* How to do a more systematic review?
* What is known about systemic bias in research?
* How are the hypotheses selected?
* Classification and label selection.
1) Identify our a priori assumptions.
What are our hypotheses? Which are we knowingly including and excluding? Which hypotheses are we not discriminating between? Call out prejudices and gaps, and explicitly list them. Try to state these in terms of hypotheses as they relate to the experiment.
* Edit acceptance by the editor community may be related to the number of times curse words are used.
* Edit acceptance may be related to the number of times informal or familiar words are used.
* The editor's state of mind matters.
- We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
- We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
- We should correlate against other editor observable properties, such as pace of work.
- We should look at correlations against other editor internal state, like their mood as seen through their own writing, and prior behavior.
- Editor has a self-perception about their edit, this is its own set of classifications.
* (need hypothesis) Drama caused by the revert should be considered. Measure it, and its effect on retention. Drama must be hand-keyed, looks like not self-reported.
* The revert was a correct decision, that we might want to emulate. Offer alternative hypotheses: some reverts are helpful and others should have been avoided.
* The wp10 scale is helpful, and is used correctly on average.
* There are an infinite number of hypotheses. Many can be eliminated by being obviously wrong, but we need to do the elimination systematically.
Is the training data biased?
* Yes, it's recent.
* The human outcomes are biased.
* The revert outcomes are inappropriately mushed into a binary. Probably, WP10 is not well distributed, either. Should we try to expand to a continuous scale, and then look at the windows covered by each wp10 category. Or does that go against classification? Oof, we'd have to give each category a place on the scale, which will make a nonlinear space. Unless we normalize by the number that fall into each group?
How is the choice of model algorithm a bias?
* We've chosen supervised learning, in which we define inputs (causality), the set of classifications, and encoded some norms via choice of training data and features.
Are the chosen classifications biased?
* Yes. They are defined by norms. One could argue that this is unbiased, but as norms change, the biases will be revealed. Compare a training set from the first few years of WP to the norms captured in recent data.
2) Does our model causally match reality?
http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node33.html
What is causal structure of the model? Make sure we are providing all the available inputs.
* Original edit:
- Inputs: state of article(s), identity of author, language
- Latent: mood of author, experience of author, sources available
- Outputs: textual delta, time of edit
* Revert:
- Inputs: delta, initial (and final) state of article, identity of author, identity of editor
- Latent: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
- Outputs: did revert? Stated reason for revert. Time of action.
* WP10:
- Inputs: current state of article, identity of judge(s), current norms
- Latent: quality of other articles'
- Outputs: article class, time of judgement
* Our scoring:
- Inputs: article and editor metadata, reference data: badwords
- Latent: Choice of training data, choice of model
- Outputs: article class or score, model revision
* Cluster analysis (the unsupervised analogue of classification). Maybe we should run one to see how our labels line up with the centroids of impartially chosen classes. For example, crude vandalism would stand out as its own cluster.
== 3) Statistically computing our bias and variance ==
Looks like the scoring_model.test function already starts to do this. Elaborate with the functions below.
Evaluate bias from every test point [1], plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes [2], use the resulting learning curve to determine if we have high bias or variance, meaning we need to adjust hypotheses.
[1] http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf
Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995.
[2] https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/
* Compare training bias and training error. Check learning curves.
== Notes ==
* Design a language registry. Declarative config, which still adapts to your system (e.g. "no svspell.")
en
- badwords
-
-
* Existing imports should work. Also add static ISO getter for new way.
* Current article importance rating is an input feature.
* Operates on one language at a time. dataset and model Makefile paths are hardcoded.
* Write a reference GUI.
* Want to convert noise to signal by finding potentially causal inputs.
* Run many models, with different goals and target classifications.
* http://cogcomp.cs.illinois.edu/papers/HannekeRo04.pdf
Semi-supervised learning, with a small amount of labeled data.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.9681&rep=rep1&type=pdf
survey of semi-supervised learning, 2005
* http://tests-zingarelli.googlecode.com/svn-history/r336/trunk/2-Artigos-Projeto/Revisao-Sistematica/Kitchenham-Systematic-Review-2004.pdf
systematic review handbook for SW engineers
=== Questions ===
* Is there a reason we're shying away from unsupervised methods?
- http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node34.html
* Unsupervised models don't do deep hierarchies
* In unsupervised learning, the inputs observations are also caused by latent variables. This does model our system more accurately.
* What are models?
- . file format: pickled support vector and random forest
* have they stopped learning? Wouldn't we need ongoing labeling to continue learning?
- https://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selection
* Is there an action the reverse of reverting? Vouching for a fact?
* Added/removed words assume the unit of words. Can we generalize? Beyond two- and three-word phrases. Punctuation, spacing.
* Are segments a sequence of words, or generalized tokens?
* Are we getting the root of the word?
* Explain how training data is gathered. Which revisions, historical or recent?
- Has been on the past year, we should look at trends as well, though.
* How are badwords lists created?
- Start with abusefilter dump or other overly long list.
- Native speaker hand codes.
- Existing lists: e.g. [[https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service/2014#Badword_lists|m:Research talk:Revision scoring as a service/2014#Badword lists]]
* Wikilabels
* Perhaps focusing the arbitration, oversight, mediation down to a smaller group, we're actually hurting.
* What are the opportunities for continued ML using feedback such as, human entry of wp10, revert, labels?
= Potential biases =
* Since reverted is a subjective decision, e.g. it can be for cause or not, we are perpetuating all biases.
- Should give editors a "why" menu.
- And split a revert decision into, verdict and sentencing, which could be reviewed by a third editor.
* Feature selection excludes some hypotheses. Cover any imaginable hypothesis with features.
* If we use training from one wiki to test another, we have imposed norms.
= Investigations =
* What are guidelines for creating new features? Seems like the more, the merrier?
* We'll need a new ML model capable of finding the behavioral clusters? Could use SVC if we define the classifications.
* Are we utilizing all inputs effectively? How is "log" decided upon?
* More features:
* Editor mood: recently did a similar type of work. Can we represent this as connectivity? Simplest to just take pre and post samples.
* Editor mood: got in discussion of labeled class around the time of this edit
* Editor pace, how long did they take to make this edit, what is their average pace during a window around this time?
* More features:
* Editor mood: recently did a similar type of work. Can we represent this as connectivity? Simplest to just take pre and post samples.
* Editor mood: got in discussion of labeled class around the time of this edit
* Editor pace, how long did they take to make this edit, what is their average pace during a window around this time?
* Editor connectivity
* [Hand-key] both edit and revert.
* Time of day
* Time of year
* Reverted words and phrases currently appear in the article
* Cause for revert (self-reported)
* Cause for revert (keyed or classified)
* Article category
https://en.wikipedia.org/wiki/Model_selection
* How to select training data?
- Real-world sample is better than equally distributed representatives: http://www.ncbi.nlm.nih.gov/pubmed/8329602
* Learning
- Have to define a cost function so the machine knows what is optimal. For supervised learning, it's just related to whether we matched the classification.
What are are using?
* Scikit-learn
- random forest - wp10 models
- naive bayes
* gaussian NB
* multinomial NB
* bernoulli nb
* http://scikit-learn.org/stable/modules/naive_bayes.html
* decent classifier, but bad estimator. contentious:
- http://stats.stackexchange.com/questions/71330/are-posterior-probabilities-from-a-naive-bayes-classifier-reliable
* The Optimality of Naive Bayes
- support vector classifier
* linear kernel - reverted models
* rbf kernel
* do not directly give probabilities without more expensive calculation
https://en.wikipedia.org/wiki/Artificial_neural_network#Training_issues
Classification is a type of supervised learning, your thing learns a fixed set.
reverted is currently doing probabilistic classification. It can give confidence, or abstain from judgement.
Reducing the manual entry load gives us access to high quality classification over a larger test set.
TODO:
* Document ores functions.
* Sketch the optional self-labeled revert feature. Need to give long-term feedback for bad labeling.
==
http://www2.cs.kuleuven.be/publicaties/lirias/mypubs.php?unum=U0008122
Hypothesis space
http://www.homooeconomicus.org/lib/getfile.php?articleID=202&PHPSESSID=eafcb72a9b2bd2cc1476492e0f3d06cc
theory choosing vs hypothesis choosing