Page MenuHomePhabricator

Implement word frequency diff features
Closed, ResolvedPublic

Description

So, we can do some relatively intelligent diffing between python dicts (aka hash maps). We should implement a way to generate a word count distribution for a revision of a page.

E.g.

"This is a content.  I have a content." -->
{
  "this": 1,
  "is": 1,
  "a": 2,
  "content": 2
  "i": 1,
  "have" 1
}

Using this, we can get a sense for how unusual a new contribution is.

"This is a content.  I have a content." --> "This is a content.  Content is this."
{
  "content": 0,
  "is": 1,
  "this": 1,
  "a": -1,
  "have": -1,
  "i": -1
}

Using this representation of a word frequency diff, we can detect changes that add new words to the page (probably strongly associated with new meaning) -- e.g. proportional additions and removals. Working with the diff above and comparing to the initial revision, we get the following addition and removal proportions.

additions:
{
  "is": 1,  # +1/1 = increase of 100%
  "this": 1  # +1/1 = increase of 100%
}
removals:
{
  "a": -0.5, # -1/2 = decrease of 50%
  "have": -1, # -1/1 = decrease of 100%
  "i": -1 # -1/1 = decrease of 100%
}

I suspect that this will be particularly valuable for badwords. E.g. if one were to edit the article about a particular curse word (e.g. https://en.wikipedia.org/wiki/Shit), adding a new instance of that curse word to the article would result in a minor proportional change while adding a different curse word would result in a large proportional change.

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Machine-Learning-Team (Active Tasks) board.
Halfak added a subscriber: Halfak.
Halfak set Security to None.

I've got the changes for this wrapped up in a big pull request I am working on. I realized that it would be a pain to implement this NLP strategy in revscoring's current structure, so I'm including it with the work for T121005

See https://github.com/wiki-ai/revscoring/blob/features_commons/revscoring/datasources/meta/frequencies.py