- try to transform markup/html elements into pure text
- try to filter out comments that look like templates, maybe based on some measure of string similarity
- remove duplicates. I'm not sure why some comments appear more than once in the corpus
Do this for both a random sample of diffs and for diffs from user's blocked for harassment.