Page MenuHomePhabricator

Clean Up User Talk Diffs Datasets
Closed, ResolvedPublic

Description

  1. try to transform markup/html elements into pure text
  2. try to filter out comments that look like templates, maybe based on some measure of string similarity
  3. remove duplicates. I'm not sure why some comments appear more than once in the corpus

Do this for both a random sample of diffs and for diffs from user's blocked for harassment.