NLP (PCFG) work (September 28th, 2016)
September 28th, 2016

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000098.html)

I've been looking at some recent work that used Probabilistic Context-free Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick message to share some progress.

I've built a python library that implements a really simple PCFG training and scoring strategy and written a quick demo of how it can work. In the following demo, I show how we can build a probabilistic grammar using the I'm a Little Teapot song[4]. Note how sentences that are not characteristic of the song score lower. Note that scores are log-scaled.

>>> sentences = [
...              "I am a little teapot",
...              "Here is my handle",
...              "Here is my spout",
...              "When I get all steamed up I just shout tip me over and pour me out",
...              "I am a very special pot",
...              "It is true",
...              "Here is an example of what I can do",
...              "I can turn my handle into a spout",
...              "Tip me over and pour me out"]
>>>
>>>
>>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in sentences)
>>>
>>> teapot_grammar.score(bllip_parse("Here is a little teapot"))
-9.392661928770137
>>> teapot_grammar.score(bllip_parse("It is my handle"))
-10.296301543090733
>>> teapot_grammar.score(bllip_parse("I am a spout"))
-10.40166205874856
>>> teapot_grammar.score(bllip_parse("Your teapot is gay"))
-12.96352974967269
>>> teapot_grammar.score(bllip_parse("Your mom's teapot is asldasnldansldal"))
-19.424997926026403

This work is inspired by work that Arthur Tilley (@aetilley) did on our team a last year[5]. The 'kasami' library represents a narrow slice of Arthur's work.

Next, I'm working on building out revscoring to implement some features
that use the scoring strategy on sentenced modified in an edit. I'm hoping
that this type of feature engineering will allow us to catch edits that
make articles more/less notable. I'm also targeting spammy language and
insults.

  1. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar
  2. http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf
  3. https://github.com/halfak/kasami
  4. https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot
  5. https://github.com/aetilley/pcfg

-@Halfak

Written by Halfak on Jun 3 2017, 5:05 PM.
Principal Research Scientist
Projects
Subscribers
aetilley