Page MenuHomePhabricator

Implement word frequency diff features
Closed, ResolvedPublic

Description

So, we can do some relatively intelligent diffing between python dicts (aka hash maps). We should implement a way to generate a word count distribution for a revision of a page.

E.g.

"This is a content.  I have a content." -->
{
  "this": 1,
  "is": 1,
  "a": 2,
  "content": 2
  "i": 1,
  "have" 1
}

Using this, we can get a sense for how unusual a new contribution is.

"This is a content.  I have a content." --> "This is a content.  Content is this."
{
  "content": 0,
  "is": 1,
  "this": 1,
  "a": -1,
  "have": -1,
  "i": -1
}

Using this representation of a word frequency diff, we can detect changes that add new words to the page (probably strongly associated with new meaning) -- e.g. proportional additions and removals. Working with the diff above and comparing to the initial revision, we get the following addition and removal proportions.

additions:
{
  "is": 1,  # +1/1 = increase of 100%
  "this": 1  # +1/1 = increase of 100%
}
removals:
{
  "a": -0.5, # -1/2 = decrease of 50%
  "have": -1, # -1/1 = decrease of 100%
  "i": -1 # -1/1 = decrease of 100%
}

I suspect that this will be particularly valuable for badwords. E.g. if one were to edit the article about a particular curse word (e.g. https://en.wikipedia.org/wiki/Shit), adding a new instance of that curse word to the article would result in a minor proportional change while adding a different curse word would result in a large proportional change.

Event Timeline

Halfak created this task.Dec 9 2015, 8:49 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:49 PM
Halfak updated the task description. (Show Details)Dec 9 2015, 8:50 PM
Halfak set Security to None.

I've got the changes for this wrapped up in a big pull request I am working on. I realized that it would be a pain to implement this NLP strategy in revscoring's current structure, so I'm including it with the work for T121005

See https://github.com/wiki-ai/revscoring/blob/features_commons/revscoring/datasources/meta/frequencies.py

Halfak claimed this task.Dec 23 2015, 4:15 AM
Halfak added a project: revscoring.
Halfak closed this task as Resolved.Jan 21 2016, 3:43 PM