Change Details

**This card is done when** we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting). Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias Very embarrassing draft notes: == Methodology == What are our hypotheses? What are we knowingly excluding? Call out prejudices and gaps. * That edit acceptability may be related to the number of times curse words are used. * That edit acceptability may be related to the number of times informal or familiar words are used. * The editor's state of mind matters. - We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week. - We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test. - We are not correlating against other editor internal state, like their mood as seen through their own writing, and prior behavior. - We aren't correlating other editor observable properties, such as pace of work. * Drama caused by the revert is not considered. Effect on retention. * The revert was a correct decision, that we might want to emulate. * The wp10 scale is helpful, and is used correctly on average. Is the training data biased? * Yes, it's recent. * The human outcomes are biased. * The revert outcomes are inappropriately mushed into a binary. Probably, WP10 is not well distributed, either. Should we try to expand to a continuous scale, and then look at the windows covered by each wp10 category. Or does that go against classification? Oof, we'd have to give each category a place on the scale, which will make a nonlinear space. Unless we normalize by the number that fall into each group? How is the choice of model algorithm a bias? * We've chosen supervised learning, in which we define inputs (causality), the set of classifications, and encoded some norms via choice of training data and features. Are the chosen classifications biased? * Yes. They are defined by norms. One could argue that this is unbiased, but as norms change, the biases will be revealed. Compare a training set from the first few years of WP. http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node33.html What is causal structure of the model? Make sure we are providing all the available inputs. * Original edit: - Inputs: state of article(s), identity of author, language - Mediating: mood of author, experience of author, sources available - Outputs: textual delta, time of edit * Revert: - Inputs: delta, initial (and final) state of article, identity of author, identity of editor - Mediating: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase - Outputs: did revert? Stated reason for revert. Time of action. * WP10: - Inputs: current state of article, identity of judge(s), current norms - Mediating: other articles' quality - Outputs: article class, time of judgement * Our scoring: - Inputs: article and editor metadata, reference data: badwords - Mediating: Choice of training data, choice of model - Outputs: article class or score, model revision == How to evaluate statistical bias == http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf * Evaluate bias from every test point. * Looks like the scoring_model.test function already starts to do this. https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/ * Compare training bias and training error. Check learning curves. == Notes == * Operates on one language at a time. dataset and model Makefile paths are hardcoded. * Write a reference GUI. === Questions === * Is there a reason we're shying away from unsupervised methods? - http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node34.html * Unsupervised models don't do deep hierarchies * In unsupervised learning, the inputs observations are also caused by latent variables. This does model our system more accurately. * What are models? - . file format: pickled support vector and random forest * have they stopped learning? Wouldn't we need ongoing labeling to continue learning? - https://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selection * Is there an action the reverse of reverting? Vouching for a fact? * Added/removed words assume the unit of words. Can we generalize? Beyond two- and three-word phrases. Punctuation, spacing. * Are segments a sequence of words, or generalized tokens? * Are we getting the root of the word? * Explain how training data is gathered. Which revisions, historical or recent? - Has been on the past year, we should look at trends as well, though. * How are badwords lists created? - Start with abusefilter dump or other overly long list. - Native speaker hand codes. * Wikilabels * Perhaps focusing the arbitration, oversight, mediation down to a smaller group, we're actually hurting. * What are the opportunities for continued ML using feedback such as, human entry of wp10, revert, labels? = Potential biases = * Since reverted is a subjective decision, e.g. it can be for cause or not, we are perpetuating all biases. - Should give editors a "why" menu. - And split a revert decision into, verdict and sentencing, which could be reviewed by a third editor. * Feature selection excludes some hypotheses. Cover any imaginable hypothesis with features. * If we use training from one wiki to test another, we have imposed norms. = Investigations = * What are guidelines for creating new features? Seems like the more, the merrier? * We'll need a new ML model capable of finding the behavioral clusters? Could use SVC if we define the classifications. * Are we utilizing all inputs effectively? How is "log" decided upon? * More features: * Editor mood: recently did a similar type of work. Can we represent this as connectivity? Simplest to just take pre and post samples. * Editor mood: got in discussion of labeled class around the time of this edit * Editor pace, how long did they take to make this edit, what is their average pace during a window around this time? * Editor connectivity * [Hand-key] both edit and revert. * Time of day * Time of year * Reverted words and phrases currently appear in the article * Cause for revert (self-reported) * Cause for revert (keyed or classified) * Article category https://en.wikipedia.org/wiki/Model_selection * How to select training data? - Real-world sample is better than equally distributed representatives: http://www.ncbi.nlm.nih.gov/pubmed/8329602 * Learning - Have to define a cost function so the machine knows what is optimal. For supervised learning, it's just related to whether we matched the classification. What are are using? * Scikit-learn - random forest - wp10 models - naive bayes * gaussian NB * multinomial NB * bernoulli nb * http://scikit-learn.org/stable/modules/naive_bayes.html * decent classifier, but bad estimator. contentious: - http://stats.stackexchange.com/questions/71330/are-posterior-probabilities-from-a-naive-bayes-classifier-reliable * The Optimality of Naive Bayes - support vector classifier * linear kernel - reverted models * rbf kernel * do not directly give probabilities without more expensive calculation https://en.wikipedia.org/wiki/Artificial_neural_network#Training_issues Classification is a type of supervised learning, your thing learns a fixed set. reverted is currently doing probabilistic classification. It can give confidence, or abstain from judgement. Cluster analysis is a well established thing. It's the unsupervised analogue of classification. Reducing the size of manual entry needed gives us access to high quality classification. TODO: * Document ores functions. * Sketch the optional self-labeled revert feature. Need to give long-term feedback for bad labeling.