Some wikis are dominated by bots. These wikis tend to have few reverted edits to train on. We need a good strategy for gathering balanced sets non-damaging and potentially damaging edits from Wikipedia.
editquality extract_balanced_sample <dump-file>... --host=<url> [--start=<date>] [--end=<date>]
This script should process edits to pages that fall between `start` and `end` and keep track of sets of revision IDs that correspond to are either reverted or not reverted. In the end, it should output a TSV file that has three columns: `rev_id`, `potentially_damaging` and `reason`. The dataset should contain the exact same number of potentially_damaging and not potentially_damaging rows. If `--reverted-only` only //reverted edits// will be considered potentially damaging (useful for training `reverted` models). Otherwise, all edits from non-trusted users will be flagged as well.