It looks like there's a file checked in for this. The file looks very different (has way more reverted edits) than what is generated by the Makefile command. The Makefile command should reflect how the file was really generated.
Description
Description
Event Timeline
Comment Actions
See https://github.com/wiki-ai/editquality/blob/master/datasets/enwiktionary.rev_reverted.20k_2016.tsv
It looks like maybe this file was generated by sampling from the 200k sample file. I'm going to try generating a dataset from that larger file.
Comment Actions
$ cat datasets/enwiktionary.prelabeled_revisions.200k_2016.tsv | grep "reverted" | wc 821 3284 22988 $ cat datasets/enwiktionary.rev_reverted.20k_2016.tsv | grep "True" | wc 815 1630 11410
Well, that looks like a promising direction.
$ cat datasets/enwiktionary.rev_reverted.20k_2016.tsv | grep "True" | cut -f1 | sort | head 32446761 32446914 32447343 32447567 32448513 32451957 32452977 32462224 32466357 32468155 $ cat datasets/enwiktionary.prelabeled_revisions.200k_2016.tsv | grep "reverted" | cut -f1 | sort | head 32446761 32446914 32447343 32447567 32448513 32451957 32453530 32462299 32462964 32466357
Comment Actions
OK. My plan is to run label_reverted on the 200k dataset and then do this:
(head -n1 datasets/enwiktionary.rev_reverted.200k_2016.tsv; (tail -n+2 datasets/enwiktionary.rev_reverted.200k_2016.tsv | \ grep "False" | shuf -n 20000; tail -n+2 datasets/enwiktionary.rev_reverted.200k_2016.tsv | \ grep "True") | shuf;) > datasets/enwiktionary.rev_reverted.weighted.20k_2016.tsv