Some wikis are dominated by bots. These wikis tend to have few reverted edits to train on. We need a good strategy for gathering balanced sets non-damaging and potentially damaging edits from Wikipedia.
editquality extract_balanced_sample <dump-file>... --host=<url> [--start=<date>] [--end=<date>] [--trusted-groups=<groups>] [--trusted-edits=<num>] [--revert-radius=<revs>] [--revert-window=<hrs>] [--reverted-only] [--verbose] [--debug]
This script should process edits to pages that fall between start and end and keep track of sets of revision IDs that correspond to are either reverted or not reverted. In the end, it should output a TSV file that has three columns: rev_id, potentially_damaging and reason. The dataset should contain the exact same number of potentially_damaging and not potentially_damaging rows. If --reverted-only only reverted edits will be considered potentially damaging (useful for training reverted models). Otherwise, all edits from non-trusted users will be flagged as well.