Page MenuHomePhabricator

Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv
Closed, ResolvedPublic

Description

It looks like there's a file checked in for this. The file looks very different (has way more reverted edits) than what is generated by the Makefile command. The Makefile command should reflect how the file was really generated.

Event Timeline

See https://github.com/wiki-ai/editquality/blob/master/datasets/enwiktionary.rev_reverted.20k_2016.tsv

It looks like maybe this file was generated by sampling from the 200k sample file. I'm going to try generating a dataset from that larger file.

$ cat datasets/enwiktionary.prelabeled_revisions.200k_2016.tsv | grep "reverted" | wc
    821    3284   22988
$ cat datasets/enwiktionary.rev_reverted.20k_2016.tsv | grep "True" | wc
    815    1630   11410

Well, that looks like a promising direction.

$ cat datasets/enwiktionary.rev_reverted.20k_2016.tsv | grep "True" | cut -f1 | sort | head
32446761
32446914
32447343
32447567
32448513
32451957
32452977
32462224
32466357
32468155

$ cat datasets/enwiktionary.prelabeled_revisions.200k_2016.tsv | grep "reverted" | cut -f1 | sort | head
32446761
32446914
32447343
32447567
32448513
32451957
32453530
32462299
32462964
32466357

OK. My plan is to run label_reverted on the 200k dataset and then do this:

(head -n1 datasets/enwiktionary.rev_reverted.200k_2016.tsv;
 (tail -n+2 datasets/enwiktionary.rev_reverted.200k_2016.tsv | \
  grep "False" | shuf -n 20000;
  tail -n+2 datasets/enwiktionary.rev_reverted.200k_2016.tsv | \
  grep "True") | shuf;) >
datasets/enwiktionary.rev_reverted.weighted.20k_2016.tsv