Page MenuHomePhabricator

Nulls appear in labeled data (merge_labels issue)
Closed, ResolvedPublic

Description

$ nice make models
cat datasets/enwiki.labeled_revisions.w_cache.20k_2015.json | \
revscoring cv_train \
        revscoring.scoring.models.GradientBoosting \
        editquality.feature_lists.enwiki.damaging \
        damaging \
        --version=0.4.0 \
        -p 'learning_rate=0.01' \
        -p 'max_depth=7' \
        -p 'max_features="log2"' \
        -p 'n_estimators=700' \
        --label-weight "true=10" \
        --pop-rate "true=0.034163555464634586" \
        --pop-rate "false=0.9658364445353654" \
        --center --scale > models/enwiki.damaging.gradient_boosting.model

Traceback (most recent call last):
  File "/srv/home/halfak/venv/3.5/bin/revscoring", line 11, in <module>
    sys.exit(main())
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/revscoring/revscoring.py", line 51, in main
    module.main(sys.argv[2:])
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/revscoring/utilities/cv_train.py", line 119, in main
    for ob in observations]
  File "/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/revscoring/utilities/cv_train.py", line 119, in <listcomp>
    for ob in observations]
KeyError: 'damaging'
Makefile:856: recipe for target 'models/enwiki.damaging.gradient_boosting.model' failed
make: *** [models/enwiki.damaging.gradient_boosting.model] Error 1
make: *** Deleting file 'models/enwiki.damaging.gradient_boosting.model'
/srv/home/halfak/venv/3.5/lib/python3.5/site-packages
$ cat datasets/enwiki.labeled_revisions.w_cache.20k_2015.json | json2tsv damaging | sort | uniq -c
  18693 False
    104 null
    751 True

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Confirmed broken. This line in human_labeled,

datasets/enwiki.human_labeled_revisions.20k_2015.json:{"auto_labeled": false, "autolabel": {}, "rev_id": 652836891}

is allowed through to labeled_revisions by the merge_labels utility.

I'll fix and write a test.

It's an edge case that can only happen when no autolabeled file is given, and we're only passing human labeled data to merge_labels. Maybe we want to stop this usage and write a separate tool?

Split this patch out so we can merge it ahead of the 2.2.2 update:
https://github.com/wiki-ai/editquality/pull/154

awight mentioned this in Unknown Object (Phame Post).May 2 2018, 6:41 PM
awight mentioned this in Unknown Object (Phame Post).
awight mentioned this in Unknown Object (Phame Post).