Implement word frequency diff features
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 9 2015, 8:49 PM

Description

So, we can do some relatively intelligent diffing between python dicts (aka hash maps). We should implement a way to generate a word count distribution for a revision of a page.

E.g.

"This is a content.  I have a content." -->
{
  "this": 1,
  "is": 1,
  "a": 2,
  "content": 2
  "i": 1,
  "have" 1
}

Using this, we can get a sense for how unusual a new contribution is.

"This is a content.  I have a content." --> "This is a content.  Content is this."
{
  "content": 0,
  "is": 1,
  "this": 1,
  "a": -1,
  "have": -1,
  "i": -1
}

Using this representation of a word frequency diff, we can detect changes that add new words to the page (probably strongly associated with new meaning) -- e.g. proportional additions and removals. Working with the diff above and comparing to the initial revision, we get the following addition and removal proportions.

additions:
{
  "is": 1,  # +1/1 = increase of 100%
  "this": 1  # +1/1 = increase of 100%
}
removals:
{
  "a": -0.5, # -1/2 = decrease of 50%
  "have": -1, # -1/1 = decrease of 100%
  "i": -1 # -1/1 = decrease of 100%
}

I suspect that this will be particularly valuable for badwords. E.g. if one were to edit the article about a particular curse word (e.g. https://en.wikipedia.org/wiki/Shit), adding a new instance of that curse word to the article would result in a minor proportional change while adding a different curse word would result in a large proportional change.

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T120138 [Epic] Explore disparate impacts of damage detection and goodfaith prediction on anons and newcomers.
Resolved	Halfak	T122269 [epic] revscoring 1.0.0
Resolved	Halfak	T121003 Implement word frequency diff features

Event Timeline

Halfak created this task.Dec 9 2015, 8:49 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:49 PM

Halfak updated the task description. (Show Details)Dec 9 2015, 8:50 PM

Halfak set Security to None.

Halfak moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Dec 23 2015, 4:13 AM

Halfak added a parent task: T122269: [epic] revscoring 1.0.0.

I've got the changes for this wrapped up in a big pull request I am working on. I realized that it would be a pain to implement this NLP strategy in revscoring's current structure, so I'm including it with the work for T121005

See https://github.com/wiki-ai/revscoring/blob/features_commons/revscoring/datasources/meta/frequencies.py

Halfak claimed this task.Dec 23 2015, 4:15 AM

Halfak added a project: revscoring.

Halfak moved this task from Backlog to Review on the Machine-Learning-Team (Active Tasks) board.Dec 29 2015, 3:41 PM

ToAruShiroiNeko removed a project: revscoring.Jan 1 2016, 2:36 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jan 15 2016, 5:57 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:43 PM

Implement word frequency diff featuresClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Implement word frequency diff features
Closed, ResolvedPublic
Actions

Related Objects
Search...