Page MenuHomePhabricator

Spike: Methods for looking for systemic biases in ORES and consider how to neutralize them
Closed, ResolvedPublic

Description

This card is done when we have evaluated several potential methodologies and research questions around systemic bias in ORES and are ready to get to work (time permitting).

Some initial discussion: https://meta.wikimedia.org/wiki/Research_talk:Revision_scoring_as_a_service#Perpetuating_bias

Methodologies

Not necessarily in chronological order. We might want to approach the problem of systematic bias in ORES from several perspectives.

Literature review

Suggested work: Build knowledge that helps us see the general problem boundaries, and where we're excluding hypotheses.

  • How to do a more systematic review? http://people.ucalgary.ca/~medlibr/kitchenham_2004.pdf Procedures for Performing Systematic Reviews Handbook for software engineers to make unbiased surveys of existing literature.
  • What is known about systemic bias in research?
  • Find writing about how hypotheses, classes, and labels are selected.

Identify our a priori assumptions

Suggested work: Pick apart our research plan. Find alternative hypotheses that should be tested.

What are our hypotheses? Which are we knowingly including and excluding? Which hypotheses are we not discriminating between? Call out prejudices and gaps, and explicitly list them. Try to state these in terms of hypotheses as they relate to the experiment.

Key:
(AI) Already Included, the hypothesis is tested by our experiment.
(NT) Needs Testing, we should add a feature or other design elements which help us discriminate whether this hypothesis might be true.
(WNT) Will Not Test, we've considered it but have decided not to pursue.

  • (AI) Edit acceptance by the editor community may be related to the number of times curse words are used.
  • (AI) Edit acceptance may be related to the number of times informal or familiar words are used.
  • (AI) The editor's state of mind matters to what they are writing. The reverting editor's state of mind also matters.
    • (AI) We're using indirect measurements of (original) editor frame of mind, by taking into account time of day and day of week.
    • (NT) We aren't measuring anything about the editors doing the reverting or wp10 nomination and voting, only about the editor making a change under test.
    • (NT) We should correlate against other editor observable properties, such as pace of work.
    • (NT) We should look at correlations against other editor internal state, like their mood as seen through their own writing, and prior behavior.
    • (NT) Editor has a self-perception about their edit, this is its own set of classifications.
  • (AI) The revert was a correct decision, that we might want to emulate. (NT) Offer alternative hypotheses.
    • (NT) Some reverts are helpful but others should have been avoided.
    • (NT) Revert is a multiclass decision that we are rounding off to a binary.
    • (NT) We care about drama caused by a revert. Measure it, and its effect on retention. Drama must be hand-keyed by a third party, looks like self-reported will be a bust.
    • (NT) Since reverted is a subjective decision, we are perpetuating all existing biases. Testing this bias is easy: train one model on 2004 and another on 2014, then compare their predictions.
      • (NT) Editors should be given a "why" menu for labeling and self-reporting their motivation. This is one way to discriminate a multiclass.
      • (WNT) In the editor UI, revert decisions could be split into verdict and sentencing workflows, configured as a peer review mechanism for example. This is a reasonable way to begin unpacking revert, that they are initiated due to an editor's judgement of revision quality, then there is a second decision whether to take the revert action.
  • (AI) The wp10 scale is helpful, and is used correctly on average. (WNT) We aren't proposing any alternatives, yet.
    • (NT) Normalize the wp10 scale by the number of articles in each class. Does it reduce any dimensions if we make the classifier's output more linear?
  • Many more hypotheses about ORES can be explicitly eliminated as obviously wrong, but we need to look through the different hypothesis spaces systematically.

Is the training data biased?

  • Yes, it's recent.
  • It's being fed to us by another process (?), which is not necessarily a random sample. Is this more instructive than a sample? [1]
  • Revert decisions are made by biased humans.
  • Revert is coded as a binary, but we could add classes and more dimensions.

How is the choice of classifier algorithm a bias?

  • We've chosen supervised learning, in which we define inputs (causality), the set of classifications, training data, and features. We have encoded norms and assumptions in all of these design parameters, and determining the hypotheses.

Are the chosen classifications biased?

  • Yes, they grew out of existing norms. See how norms change.
  • Can our classifier abstain from judgement? I read that probabilistic classification is good at that. If so, then we should actually have three categories, (reverted, not reverted, and undecided).

What is our motivation and what are the objectives?

  • As the metawiki project page explains, to provide a standardized scoring service, so bots are easier to port across wikis, easier to write and support, and new tools can flourish.
  • Provide a library and a service.
  • Make further research easier by publishing data sets and tools.

Identify the methodology used to build the existing system.

How are badwords and other reference data created?

Does our model match the real causality?

Suggested work: Model each causal step, write features and models to cover all inputs, and to infer things about latent variables.

What is the causal structure of our model? [2]

  • Original edit:
    • Inputs: state of article(s), identity of author, language, local time
    • Latent: mood of author, experience of author, sources available
    • Outputs: textual delta, time of edit
  • Revert:
    • Inputs: delta, initial (and final) state of article, identity of author, identity of editor, local time
    • Latent: mood of editor, experience of editor, sources available, existing relationship with author, choice of wording and phrase
    • Outputs: did revert? Stated reason for revert.
  • WP10:
    • Inputs: current state of article, identity of judge(s), current norms
    • Latent: quality of other articles
    • Outputs: article class, time of judgement
  • Our scoring:
    • Inputs: article and editor metadata as above, reference data, e.g. badwords. Choice of training data, choice of model
    • Latent: model
    • Outputs: article class or score, model revision

Statistically estimating what we don't know

Suggested work: Get diagnostic information about the health of our classifiers. Compare alternative models and output classes.

Looks like scoring_model.test already does some cross-validation on the model. Also evaluate bias from every test point [3], plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes [4], use the resulting learning curve to determine if we have high bias or variance, meaning we need to adjust hypotheses.

Do cluster analysis (unsupervised classification), and compare against our classes. See how our labels line up with the centroids of impartially chosen classes. For example, crude vandalism would stand out as its own cluster, but its centroid does not match reverted well.

Notes

[1] http://www.ncbi.nlm.nih.gov/pubmed/8329602

[2] http://users.ics.aalto.fi/harri/thesis/valpola_thesis/node33.html

[3] http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf
Dietterich, Thomas G., and Eun Bae Kong. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University, 1995.

[4] https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/

Event Timeline

awight raised the priority of this task from to Needs Triage.
awight updated the task description. (Show Details)
awight subscribed.
Halfak renamed this task from Spike: Look for systemic biases in ORES and consider how to neutralize them to Spike: Methods for looking for systemic biases in ORES and consider how to neutralize them.Jul 24 2015, 4:46 PM
Halfak set Security to None.
Halfak updated the task description. (Show Details)
awight triaged this task as Medium priority.Aug 3 2015, 5:06 AM
awight updated the task description. (Show Details)

It looks like there's been a good deal of work here. @awight is this done? If so, would you schedule a meeting to present what you've learned so that we can discuss next steps?

@Halfak sure, let's chat about what I have so far. I'd like to come back to this again, but I agree there's enough here for a first iteration.

awight updated the task description. (Show Details)