See @aetilley's repo
From data2.tsv I get this ( 87 1s, 715 0s). Source code to reproduce on gist
The file data2.tsv has 19863 samples, your clusters sum to 802 samples. Let me look at the code you sent and get back to you.
Because we only test on reverted edits and the last column is reverted status (not a feature). I did this mistake initially too :)
I had understood that we were interesting in clustering edits generally. Thus I just dropped the last column. Aaron, which did you have in mind?
Responded in IRC. Do both! Cluster the entire set and also cluster just the damaging set and compare the difference.