Page MenuHomePhabricator

[Epic] Implement PCFG features for editquality and draftquality
Open, LowestPublic

Description

See https://en.wikipedia.org/wiki/Stochastic_context-free_grammar

Implement a feature that, when given a set of sentences, produces a likelihood ratio that represents how likely it is that a sentence is to be generated from a subset of a corpus (e.g. "vandalism" or "spam", "featured", "attack", etc.).

Scoring library: https://github.com/halfak/kasami
Sentence models: https://github.com/wiki-ai/wikigrammars

Event Timeline

I've been working from https://github.com/halfak/pcfg

I've been thinking about file formats. I have *a lot* of concerns. E.g. the files use spaces to delimit so any spaces in literals will need to be escaped. Also, you need to process the entire file to determine which symbols are literals and which are variables. Worse, what if you have a literal that is the same as symbol? That's totally undefined!

E.g.

S NP VP 1.0
VP VT NP 0.8333333333333334
VP VI ADV 0.16666666666666666
NP DET NN 0.45454545454545453
NN cat 0.6
NP Fluffy 0.2727272727272727
VT loves 0.4
ADV soundly 1.0
NP Fido 0.2727272727272727
VT is 0.6
VI sleeps 1.0
DET a 0.8
DET Every 0.2
NN dog 0.4

What if we had processed a sentence that like "Fluffy loves NN"? Now "NN" is both a symbol and a literal!? AHHH!

So, I propose changing the format to use JSON and keys to be explicit about what is a variable and what is a literal.

{"conj": {"source": "S", "targets": ["NP", "VP"]}, "proba": 1.0}
{"conj": {"source": "VP", "targets": ["VT", "NP"]}, "proba": 0.8333333333333334}
{"conj": {"source": "VP", "targets": ["VI", "ADV"]}, "proba": 0.16666666666666666}
{"conj": {"source": "NP", "targets": ["DET", "NN"]}, "proba": 0.45454545454545453}
{"disj": {"source": "NN", "literal": "cat"}, "proba": 0.6}
{"disj": {"source": "NP", "literal": "Fluffy"}, "proba": 0.2727272727272727}
{"disj": {"source": "VT", "literal": "loves"}, "proba": 0.4}
{"disj": {"source": "ADV", "literal": "soundly"}, "proba": 1.0}
{"disj": {"source": "NP", "literal": "Fido"}, "proba": 0.2727272727272727}
{"disj": {"source": "VT", "literal": "is"}, "proba": 0.6}
{"disj": {"source": "VI", "literal": "sleeps"}, "proba": 1.0}
{"disj": {"source": "DET", "literal": "a"}, "proba": 0.8}
{"disj": {"source": "DET", "literal": "Every"}, "proba": 0.2}
{"disj": {"source": "NN", "literal": "dog"}, "proba": 0.4}

Here, "conj" represents a conjunction that produces new symbols (aka "non-terminal") and "disj" represents a disjunction that produces a literal (aka "terminal")

We could further extend this format to explicitly list the symbols and starting symbol. We could use this doc format to provide a format version and explicitly list the variables used for validation purposes.

{
  "version": "0.0.1",
  "start_symbol": "S",
  "variables": ["ADV", "NN", "NP", "S", "VI", "VP", "VT", "DET"],
  "rules": [
    {"conj": {"source": "S", "targets": ["NP", "VP"]}, "proba": 1.0}
    {"conj": {"source": "VP", "targets": ["VT", "NP"]}, "proba": 0.8333333333333334}
    {"conj": {"source": "VP", "targets": ["VI", "ADV"]}, "proba": 0.16666666666666666}
    {"conj": {"source": "NP", "targets": ["DET", "NN"]}, "proba": 0.45454545454545453}
    {"disj": {"source": "NN", "literal": "cat"}, "proba": 0.6}
    {"disj": {"source": "NP", "literal": "Fluffy"}, "proba": 0.2727272727272727}
    {"disj": {"source": "VT", "literal": "loves"}, "proba": 0.4}
    {"disj": {"source": "ADV", "literal": "soundly"}, "proba": 1.0}
    {"disj": {"source": "NP", "literal": "Fido"}, "proba": 0.2727272727272727}
    {"disj": {"source": "VT", "literal": "is"}, "proba": 0.6}
    {"disj": {"source": "VI", "literal": "sleeps"}, "proba": 1.0}
    {"disj": {"source": "DET", "literal": "a"}, "proba": 0.8}
    {"disj": {"source": "DET", "literal": "Every"}, "proba": 0.2}
    {"disj": {"source": "NN", "literal": "dog"}, "proba": 0.4}
  ]
}

Some notes from "pintoch" in Research:

if you want pre-trained models for stochastic CFG parsers, you can use (for instance) these: http://nlp.stanford.edu/software/lex-parser.shtml

I've been thinking about how to represent parse trees. The standard (???) format looks like this:

(S (NP (DET Every) (NN cat)) (VP (VT loves) (NP (DET a) (NN dog))))
(S (NP Fido) (VP (VT is) (NP (DET a) (NN cat))))
(S (NP Fido) (VP (VT is) (NP (DET a) (NN dog))))
(S (NP Fluffy) (VP (VT is) (NP (DET a) (NN cat))))
(S (NP Fido) (VP (VT loves) (NP Fluffy)))
(S (NP Fluffy) (VP (VI sleeps) (ADV soundly)))

The JSON format I came up with looks like this (with pretty formatting):

{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {
      "symbols": [
        {"source": "DET", "produces": {"literal": "Every"}}, 
        {"source": "NN", "produces": {"literal": "cat"}}
      ]}
    }, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VT", "produces": {"literal": "Every"}}, 
        {"source": "NP", "produces": {"symbols": [
          {"source": "DET", "produces": {"literal": "a"}}, 
          {"source": "NN", "produces": {"literal": "dog"}}
        ]}}
      ]}
    }
  ]}
}
{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {"literal": "Fido"}}, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VT", "produces": {"literal": "is"}}, 
        {"source": "DET", "produces": {"literal": "a"}}, 
        {"source": "NN", "produces": {"literal": "cat"}}
      ]}
    }
  ]}
}
{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {"literal": "Fido"}}, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VT", "produces": {"literal": "is"}}, 
        {"source": "DET", "produces": {"literal": "a"}}, 
        {"source": "NN", "produces": {"literal": "dog"}}
      ]}
    }
  ]}
}
{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {"literal": "Fluffy"}}, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VT", "produces": {"literal": "is"}}, 
        {"source": "DET", "produces": {"literal": "a"}}, 
        {"source": "NN", "produces": {"literal": "cat"}}
      ]}
    }
  ]}
}
{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {"literal": "Fido"}}, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VT", "produces": {"literal": "loves"}}, 
        {"source": "NP", "produces": {"literal": "Fluffy"}}
      ]}
    }
  ]}
}
{"source": "S", "produces": {
  "symbols": [
    {"source": "NP", "produces": {"literal": "Fluffy"}}, 
    {"source": "VP", "produces": {
      "symbols": [
        {"source": "VI", "produces": {"literal": "sleeps"}}, 
        {"source": "ADV", "produces": {"literal": "soundly"}}
      ]}
    }
  ]}
}

The JSON is far more verbose!

I looked into sentence parsers and AFAICT, the state of the art is BAD.

I installed the bllippparser, and got some segmentation faults:

(3.5) [halfak@graphite: ~/projects/bllip-parser]
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> rp = RerankingParser.fetch_and_load('WSJ-PTB3', verbose=True)
Model directory: /home/halfak/.local/share/bllipparser/WSJ-PTB3
Model directory already exists, not reinstalling
>>> rp.parse(["I", "am", "a", "little", "teapot"])[0]
ScoredParse('(S1 (S (NP (PRP I)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot)))))', parser_score=-64.30434900543281, reranker_score=-16.740114175058775)
>>> rp.parse(["I", "am", "a", "little", "teapot", ".", " ", "What", "?"])[0]
## preterms = ((PRP i) (VBP am) (DT a) (JJ little) (NN teapot) (. .) (VP vbz (NP (WP what))) (. ?))
## tp = (S1 (S (S (NP (PRP i)) (VP (VBP am) (NP (DT a) (JJ little) (NN teapot))) (. .)) (VP vbz (NP (WP what))) (. ?)))
Segmentation fault (core dumped)

Now I'm looking into spaCy and it's not clear that it even can produce trees like what we want.

>>> from spacy.en import English
>>> from spacy import parts_of_speech as pos
>>> parse = English()

>>> doc = parse("Every cat loves a dog")
>>> print("\n".join(
...   "\t".join(str(v) for v in 
...             (str(t), t.left_edge.i, t.right_edge.i, 
...              doc.vocab.strings[t.tag], doc.vocab.strings[t.dep])
...   ) 
...   for t in doc
... ))
Every	0	0	DT	det
cat	0	1	NN	nsubj
loves	0	4	VBZ	ROOT
a	3	3	DT	det
dog	3	4	NN	dobj

This should parse into some nice CNF tree, but I can't see how to do that here without applying some external rule.

(ROOT (nsubj (DET Every) (Noun cat)) (VERB loves) (dobj (DET a) (NOUN dog)))

Note that ROOT is a tri-nary production of ROOT --> nsubj VERB dobj. Hmm...

Here's another way to look at that:

  • ROOT
    • nsubj
      • DT Every
      • NN cat
    • VBZ loves
    • dobj
      • DT a
      • NN dog

I have successfully forced spaCy to produce a production tree and I've shown that the parsing is generally performant for English.

>>> from spacy.en import English
>>> from spacy import parts_of_speech as pos
>>> parse = English()

>>> 
>>> class Production:
...   def __init__(self, source, produces):
...     self.source = source
...     self.produces = produces
...   def __str__(self):
...     return "({0} {1})".format(self.source, " ".join(map(str, self.produces)))
... 
>>> def treeify(doc):
...   for t in doc:
...     if t.head is t:
...       root = t
...   return treeify_at(root, doc)
... 
>>> def treeify_at(token, doc):
...   if token.left_edge == token.right_edge:
...     return Production(doc.vocab.strings[token.tag], [str(token)])
...   else:
...     sorted_i_trees = \
...       sorted([(child.i, treeify_at(child, doc)) for child in token.children] +
...              [(token.i, Production(doc.vocab.strings[token.tag], [str(token)]))])
...     return Production(doc.vocab.strings[token.dep],
...                       [tree for _, tree in sorted_i_trees])
... 
>>> str(treeify(parse("Every cat loves a dog")))
'(ROOT (nsubj (DT Every) (NN cat)) (VBZ loves) (dobj (DT a) (NN dog)))'
>>> 
>>> import time
>>> start = time.time();_ = sum(treeify(parse("Every cat loves a dog")) == 0 for i in range(1000));time.time() - start
0.23185133934020996

I've been thinking that our PCFG might not need to care about non CNF productions. This parse is useful and we could learn some probabilities with it:

>>> print_tree(treeify(parse("In computer science the Cocke–Younger–Kasami algorithm alternatively called CYK or CKY is a parsing algorithm for context-free grammars named after its inventors John Cocke Daniel Younger and Tadao Kasami")))
* ROOT
  * prep
    * IN
      * 'In'
    * pobj
      * NN
        * 'computer'
      * NN
        * 'science'
      * acl
        * nsubj
          * DT
            * 'the'
          * NNP
            * 'Cocke–Younger–Kasami'
          * NN
            * 'algorithm'
        * RB
          * 'alternatively'
        * VBD
          * 'called'
        * oprd
          * NNP
            * 'CYK'
          * CC
            * 'or'
          * NNP
            * 'CKY'
  * VBZ
    * 'is'
  * nsubj
    * DT
      * 'a'
    * VBG
      * 'parsing'
    * NN
      * 'algorithm'
    * prep
      * IN
        * 'for'
      * pobj
        * amod
          * NN
            * 'context'
          * HYPH
            * '-'
          * JJ
            * 'free'
        * NNS
          * 'grammars'
        * acl
          * VBN
            * 'named'
          * prep
            * IN
              * 'after'
            * pobj
              * PRP$
                * 'its'
              * NNS
                * 'inventors'
  * dep
    * compound
      * NNP
        * 'John'
      * NNP
        * 'Cocke'
    * NNP
      * 'Daniel'
    * NNP
      * 'Younger'
    * CC
      * 'and'
    * conj
      * NNP
        * 'Tadao'
      * NNP
        * 'Kasami'

I've cleaned up a lot. See https://github.com/halfak/pcfg, specifically this commit.

Here's a demo I managed to work out the does parsing, can re-parse a formatted tree and can produce tab-formatted trees too.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bllipparser import RerankingParser
>>> from pcfg import nodes
>>> # Loading WSJ-PTB3 treebank...
... rp = RerankingParser.fetch_and_load('WSJ-PTB3')
>>> 
>>> # Parsing sentence...
... bllip_tree = rp.parse("I am a little teapot.")[0].ptb_parse
>>> def format_blliptree(tree):
...     if len(tree.subtrees()) == 0:
...         return "({0} {1})".format(tree.label, repr(tree.token))
...     else:
...         return "({0} {1})".format(
...             tree.label, " ".join(format_blliptree(sub_tree)
...                                  for sub_tree in tree.subtrees()))
... 
>>> line = format_blliptree(bllip_tree)
>>> # blipparser's parsed line:
... str(line)
(S1 (S (NP (PRP 'I')) (VP (VBP 'am') (NP (DT 'a') (JJ 'little') (NN 'teapot'))) (. '.')))
>>> 
>>> tree = nodes.parse(line)
>>> # our parsed line:
... print(str(tree))
(S1 (S (NP (PRP 'I')) (VP (VBP 'am') (NP (DT 'a') (JJ 'little') (NN 'teapot'))) (. '.')))
>>> # Tree format
... print(tree.format(depth=1))
	(S1
		(S
			(NP
				(PRP 'I')
			)
			(VP
				(VBP 'am')
				(NP
					(DT 'a')
					(JJ 'little')
					(NN 'teapot')
				)
			)
			(. '.')
		)
	)
>>> 
>>> # our productions
>>> for production in tree:
...     print("\t" + str(production))
... 
	(S1 S)
	(S NP VP .)
	(NP PRP)
	(PRP 'I')
	(VP VBP NP)
	(VBP 'am')
	(NP DT JJ NN)
	(DT 'a')
	(JJ 'little')
	(NN 'teapot')
	(. '.')

I think we're a couple hours of work away from being able to train a PCFG on a sequence of trees/productions

Halfak renamed this task from Implement PCFG features to Implement PCFG features for editquality.Oct 13 2016, 1:30 PM
Halfak renamed this task from Implement PCFG features for editquality to [Epic] Implement PCFG features for editquality.Nov 10 2016, 3:56 PM
Halfak raised the priority of this task from Low to High.
Halfak moved this task from Parked to Non-Epic on the Machine-Learning-Team (Active Tasks) board.
Halfak added a project: Epic.
Halfak renamed this task from [Epic] Implement PCFG features for editquality to [Epic] Implement PCFG features for editquality and draftquality.Nov 28 2016, 9:36 PM
Restricted Application added a subscriber: jhsoby-WMNO. · View Herald Transcript
Halfak removed Halfak as the assignee of this task.Apr 2 2019, 9:38 PM
Halfak lowered the priority of this task from High to Lowest.
Halfak moved this task from Ideas to Epic on the Machine-Learning-Team board.