So, we can do some relatively intelligent diffing between python dicts (aka hash maps). We should implement a way to generate a word count distribution for a revision of a page.
E.g.
"This is a content. I have a content." --> { "this": 1, "is": 1, "a": 2, "content": 2 "i": 1, "have" 1 }
Using this, we can get a sense for how unusual a new contribution is.
"This is a content. I have a content." --> "This is a content. Content is this." { "content": 0, "is": 1, "this": 1, "a": -1, "have": -1, "i": -1 }
Using this representation of a word frequency diff, we can detect changes that add new words to the page (probably strongly associated with new meaning) -- e.g. proportional additions and removals. Working with the diff above and comparing to the initial revision, we get the following addition and removal proportions.
additions: { "is": 1, # +1/1 = increase of 100% "this": 1 # +1/1 = increase of 100% } removals: { "a": -0.5, # -1/2 = decrease of 50% "have": -1, # -1/1 = decrease of 100% "i": -1 # -1/1 = decrease of 100% }
I suspect that this will be particularly valuable for badwords. E.g. if one were to edit the article about a particular curse word (e.g. https://en.wikipedia.org/wiki/Shit), adding a new instance of that curse word to the article would result in a minor proportional change while adding a different curse word would result in a large proportional change.