**This card is done when** we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).
Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias
Very embarrassing draft notes:
== Methodology ==
What are our hypotheses? What are we knowingly excluding? Call out prejudices and gaps.
* That edit acceptability may be related to the number of times curse words are used.
* That edit acceptability may be related to the number of times informal or familiar words are used.
* The editor's state of mind matters.
- We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
- We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
- We are not correlating against other editor internal state, like their mood as seen through their own writing, and prior behavior.
- We aren't correlating other editor observable properties, such as pace of work.
* Drama caused by the revert is not considered. Effect on retention.
* The revert was a correct decision, that we might want to emulate.
* The wp10 scale is helpful, and is used correctly on average.
Is the training data biased?
* Yes, it's recent.
* The human outcomes are biased.
* The revert outcomes are inappropriately mushed into a binary. Probably, WP10 is not well distributed, either. Should we try to expand to a continuous scale, and then look at the windows covered by each wp10 category. Or does that go against classification? Oof, we'd have to give each category a place on the scale, which will make a nonlinear space. Unless we normalize by the number that fall into each group?
How is the choice of model algorithm a bias?
* We've chosen supervised learning, in which we define inputs (causality), the set of classifications, and encoded some norms via choice of training data and features.
Are the chosen classifications biased?
* Yes. They are defined by norms. One could argue that this is unbiased, but as norms change, the biases will be revealed. Compare a training set from the first few years of WP.
What is causal structure of the model? Make sure we are providing all the available inputs.
* Original edit:
- Inputs: state of article(s), identity of author, language
- Mediating: mood of author, experience of author, sources available
- Outputs: textual delta, time of edit
- Inputs: delta, initial (and final) state of article, identity of author, identity of editor
- Mediating: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
- Outputs: did revert? Stated reason for revert. Time of action.
- Inputs: current state of article, identity of judge(s), current norms
- Mediating: other articles' quality
- Outputs: article class, time of judgement
* Our scoring:
- Inputs: article and editor metadata, reference data: badwords
- Mediating: Choice of training data, choice of model
- Outputs: article class or score, model revision
== How to evaluate statistical bias ==
* Evaluate bias from every test point.
* Looks like the scoring_model.test function already starts to do this.
* Compare training bias and training error. Check learning curves.
== Notes ==
* Operates on one language at a time. dataset and model Makefile paths are hardcoded.
* Write a reference GUI.
=== Questions ===
* Is there a reason we're shying away from unsupervised methods?
* Unsupervised models don't do deep hierarchies
* In unsupervised learning, the inputs observations are also caused by latent variables. This does model our system more accurately.
* What are models?
- . file format: pickled support vector and random forest
* have they stopped learning? Wouldn't we need ongoing labeling to continue learning?
* Is there an action the reverse of reverting? Vouching for a fact?
* Added/removed words assume the unit of words. Can we generalize? Beyond two- and three-word phrases. Punctuation, spacing.
* Are segments a sequence of words, or generalized tokens?
* Are we getting the root of the word?
* Explain how training data is gathered. Which revisions, historical or recent?
- Has been on the past year, we should look at trends as well, though.
* How are badwords lists created?
- Start with abusefilter dump or other overly long list.
- Native speaker hand codes.
* Perhaps focusing the arbitration, oversight, mediation down to a smaller group, we're actually hurting.
* What are the opportunities for continued ML using feedback such as, human entry of wp10, revert, labels?
= Potential biases =
* Since reverted is a subjective decision, e.g. it can be for cause or not, we are perpetuating all biases.
- Should give editors a "why" menu.
- And split a revert decision into, verdict and sentencing, which could be reviewed by a third editor.
* Feature selection excludes some hypotheses. Cover any imaginable hypothesis with features.
* If we use training from one wiki to test another, we have imposed norms.
= Investigations =
* What are guidelines for creating new features? Seems like the more, the merrier?
* We'll need a new ML model capable of finding the behavioral clusters? Could use SVC if we define the classifications.
* Are we utilizing all inputs effectively? How is "log" decided upon?
* More features:
* Editor mood: recently did a similar type of work. Can we represent this as connectivity? Simplest to just take pre and post samples.
* Editor mood: got in discussion of labeled class around the time of this edit
* Editor pace, how long did they take to make this edit, what is their average pace during a window around this time?
* Editor connectivity
* [Hand-key] both edit and revert.
* Time of day
* Time of year
* Reverted words and phrases currently appear in the article
* Cause for revert (self-reported)
* Cause for revert (keyed or classified)
* Article category
* How to select training data?
- Real-world sample is better than equally distributed representatives: http://www.ncbi.nlm.nih.gov/pubmed/8329602
- Have to define a cost function so the machine knows what is optimal. For supervised learning, it's just related to whether we matched the classification.
What are are using?
- random forest - wp10 models
- naive bayes
* gaussian NB
* multinomial NB
* bernoulli nb
* decent classifier, but bad estimator. contentious:
* The Optimality of Naive Bayes
- support vector classifier
* linear kernel - reverted models
* rbf kernel
* do not directly give probabilities without more expensive calculation
Classification is a type of supervised learning, your thing learns a fixed set.
reverted is currently doing probabilistic classification. It can give confidence, or abstain from judgement.
Cluster analysis is a well established thing. It's the unsupervised analogue of classification.
Reducing the size of manual entry needed gives us access to high quality classification.
* Document ores functions.
* Sketch the optional self-labeled revert feature. Need to give long-term feedback for bad labeling.