Page MenuHomePhabricator

Unlabeled goodfaith observations are assumed "false" -- should be "true"
Closed, ResolvedPublic

Description

Looks like the issue is here:
https://github.com/wiki-ai/editquality/blob/master/editquality/utilities/fetch_labels.py#L100

Here's the label where we discovered the issue:

{
  "campaign_id": 4,
  "data": {
    "rev_id": 637600969
  },
  "id": 194738,
  "labels": [
    {
      "data": {
        "damaging": true,
        "goodfaith": null,
        "unsure": false
      },
      "timestamp": 1432144745.16389,
      "user_id": 42074979
    }
  ]
},

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For English wikipedia, the change to assuming goodfaith edit when the actual label is null worked well. The AUC score grew from 0.909 to 0.928:

  1. Model tuning report
  2. Revscoring version: 1.3.15
  3. Features: editquality.feature_lists.enwiki.goodfaith
  4. Date: 2017-07-26T16:26:43.086583
  5. Observations: 19463
  6. Labels: [false, true]
  7. Scoring: roc_auc
  8. Folds: 5

Top scoring configurations

modelmean(scores)std(scores)params
:------------------------------------------:--------------::-------------------------------------------------------------------------------
GradientBoostingClassifier0.9280.006learning_rate=0.01, n_estimators=700, max_depth=5, max_features="log2"
GradientBoostingClassifier0.9270.008learning_rate=0.01, n_estimators=500, max_depth=5, max_features="log2"
GradientBoostingClassifier0.9260.008learning_rate=0.01, n_estimators=500, max_depth=7, max_features="log2"
GradientBoostingClassifier0.9260.008learning_rate=0.01, n_estimators=700, max_depth=7, max_features="log2"
GradientBoostingClassifier0.9250.005learning_rate=0.01, n_estimators=700, max_depth=3, max_features="log2"
RandomForestClassifier0.9250.006min_samples_leaf=5, max_features="log2", n_estimators=320, criterion="entropy"
GradientBoostingClassifier0.9250.009learning_rate=0.01, n_estimators=300, max_depth=7, max_features="log2"
GradientBoostingClassifier0.9250.007learning_rate=0.1, n_estimators=100, max_depth=3, max_features="log2"
GradientBoostingClassifier0.9240.007learning_rate=0.01, n_estimators=300, max_depth=5, max_features="log2"
GradientBoostingClassifier0.9240.004learning_rate=0.5, n_estimators=300, max_depth=1, max_features="log2"

For Russian wiki, the AUC haven't really changed:

  1. Model tuning report
  2. Revscoring version: 1.3.15
  3. Features: editquality.feature_lists.ruwiki.goodfaith
  4. Date: 2017-07-26T14:36:09.102391
  5. Observations: 19639
  6. Labels: [false, true]
  7. Scoring: roc_auc
  8. Folds: 5

Top scoring configurations

modelmean(scores)std(scores)params
:------------------------------------------:--------------::--------------------------------------------------------------------------------
GradientBoostingClassifier0.9350.007max_depth=7, n_estimators=700, learning_rate=0.01, max_features="log2"
RandomForestClassifier0.9350.005criterion="entropy", n_estimators=640, min_samples_leaf=5, max_features="log2"
GradientBoostingClassifier0.9350.006max_depth=7, n_estimators=500, learning_rate=0.01, max_features="log2"
RandomForestClassifier0.9350.005criterion="entropy", n_estimators=640, min_samples_leaf=7, max_features="log2"
RandomForestClassifier0.9340.005criterion="entropy", n_estimators=640, min_samples_leaf=13, max_features="log2"
RandomForestClassifier0.9340.005criterion="entropy", n_estimators=320, min_samples_leaf=5, max_features="log2"
RandomForestClassifier0.9340.005criterion="entropy", n_estimators=320, min_samples_leaf=13, max_features="log2"
RandomForestClassifier0.9340.005criterion="entropy", n_estimators=320, min_samples_leaf=7, max_features="log2"
RandomForestClassifier0.9340.005criterion="entropy", n_estimators=640, min_samples_leaf=3, max_features="log2"
GradientBoostingClassifier0.9340.006max_depth=7, n_estimators=300, learning_rate=0.01, max_features="log2"