Page MenuHomePhabricator

Implement sentences datascources
Closed, ResolvedPublic

Description

implement revision.sentences, revision.parent.sentences and revision.diff.sentences.

revisions.diff.sentences will contain two lists:

  • old sentences
  • new sentences

Normalization should make the sentences parse-able by bllip and spaCy

Event Timeline

OK. So I've been looking at getting clean sentences out of Wikipedia articles. I get a lot of nonsense around tables. It would be nice to have a good way to just parse them out before processing sentences.

When bllip needs to parse a non-sentence, it takes a long time.

>>> start = time.time();bllip_rrp.parse('{| class="infobox" style="width: 280px;"  |-  |  {|')[0];print(time.time() - start)
ScoredParse("(S1 (S (-LRB- -LCB-) (NP (NNP |)) (VP (VBZ class=) (NP (`` ``) (NP (NN infobox)) ('' '') (CC style=) (`` ``) (NP (NP (NN width)) (: :) (NP (NN 280px)) (: ;)) ('' '') (NP (NP (JJ |-) (NN |)) (-LRB- -LCB-) (NP (NNP |)))))))", parser_score=-366.92503034039254, reranker_score=-108.16146420553363)
4.158325672149658

(note that the above took 4.16 seconds) I couldn't find a good way to limit bllip so that it only considers parses that are somewhat likely. But we could use a timeout on the function call to stop parsing if it takes longer than 0.1 seconds or something like that.

With spaCy, the parse of non-sentences is fast.

>>> start = time.time();print(spacy_norm.normalize_tree(parse('{| class="infobox" style="width: 280px;"  |-  |  {|')));print(time.time() - start)
(ROOT (nmod (XX '|-') (SP ' ')) (nmod (XX '|') (SP ' ')) (-LRB- '{') (NN '|'))
0.003621816635131836

But it also looks like the trees get really weird. I'll need to look into that too.

OK. I've managed to build a simple sub-parser for handling tables. Essentially, I'll parse an entire table into one giant "sentence". This helps for using mwparserfromhell's strip_code() which will remove a whole table. Regretfully, the parser gets confused with image links.

E.g.

>>> import mwparserfromhell as mwp
>>> mwp.parse("[[File:Foo|thumb|alt=A drawing of a fly from facing up|derp]]").strip_code()
'thumb|alt=A drawing of a fly from facing up|derp'

We might want to do out own normalizations to handle links like this.

OK. I've implemented some better sentence parsing in deltas. See https://github.com/halfak/deltas/commit/85fe0b2bce4a6fbe7b945a1776ce9798b0951132

I think that, next, we'll want to run this on one of the proposed PCFG sources to try to get some sentences extracted.

I have a working scorer model trained on Featured Article sentences.

$ ./utility score models/enwiki.fa_sentence.model 'Don Henley lends his vocals shadowing lead singer Steven Tyler in parts of this song.'
2016-11-10 09:35:06,647 INFO:wikigrammar.utilities.score -- Loading model...
-136.25691142784373
$ ./utility score models/enwiki.fa_sentence.model 'Pentare is a world-class provider of pool supplies, filters, and pump equipment.'
2016-11-10 09:37:35,289 INFO:wikigrammar.utilities.score -- Loading model...
-123.57435948374528

I'll need to do some data analysis to see where we are and are not successfully differentiating good from bad sentences. In the example above, a sentence from a featured article scores worse than an a very POV/advertisey sentence.

I did some work yesterday to check on memory usage and the story isn't very good. Here's some notes I took in IRC.

[16:28:47] <halfak> OK...  So the PCGF pattern seems to be struggling with a major constraint
[16:29:06] <halfak> Loading spacy's parser into memory requires 1.7 GB
[16:29:17] <halfak> That would double our memory footprint
[16:44:55] <halfak> And it looks like my FA article PCFG is taking up an additional 1.5GB of memory. 
[16:45:10] <halfak> I wonder if we could trim that down by collapsing all proper nouns.
[16:58:37] <halfak> OK just wrote some code for that and I'm building a new model.  I think we'll want to do something for numbers too, but that'll need to wait for tomorrow. 
[16:58:38] <halfak> o/

Well. I built the new model that collapses all proper nouns and that brought us down from 3.2GB RES to 3.0GB RES. I'm looking at numbers next.

It's weird that the pickled file sizes are so small, but the in-memory space is so large.

$ ls -alh
total 344M
drwxrwxr-x 2 halfak halfak 4.0K Nov 21 16:58 .
drwxrwxr-x 6 halfak halfak 4.0K Nov  7 10:26 ..
-rw-rw-r-- 1 halfak halfak 181M Nov  7 11:42 enwiki.fa_sentence.model
-rw-rw-r-- 1 halfak halfak 164M Nov 21 17:53 enwiki.fa_sentence.normalized.model

So, it looks like collapsing numbers didn't really matter that much. So, I decided to dig into the production frequencies.

>>> sum(1 for k, v in prod_freqs if v > 0)
1563150
>>> sum(1 for k, v in prod_freqs if v > 1)
424316
>>> sum(1 for k, v in prod_freqs if v > 2)
272907
>>> sum(1 for k, v in prod_freqs if v > 3)
209849
>>> sum(1 for k, v in prod_freqs if v > 4)
174048
>>> sum(1 for k, v in prod_freqs if v > 5)
150498
>>> sum(1 for k, v in prod_freqs if v > 6)
133783

It looks like we can drop our data size by an order of magnitude if we only allow productions that happen 3 or more times.

Let's look at the most common productions.

>>> print("\n".join(str(prod) + "\t" + str(freq) for prod, freq in prod_freqs[-20:]))
(`` '"')	176059
(IN 'for')	181485
(POS "'s")	205679
(DT 'The')	216007
(pobj DT NN)	229549
(prep IN NNP)	240849
(IN 'to')	265867
(HYPH '-')	275290
(TO 'to')	282689
(VBD 'was')	283549
(DT 'a')	449022
(IN 'in')	539483
(CC 'and')	705146
(IN 'of')	837665
(CD 'CD')	978336
(. '.')	1099003
(, ',')	1517796
(DT 'the')	1550613
(prep IN pobj)	2149169
(NNP 'NNP')	3755058

OK. Well, not too much of a surprise there. Note that we have a huge 3.7m observations of our collapsed Proper nouns (NNP) and 1m observations of our collapsed quantities.

>>> model.score("I am from a featured article.")
-43.74413929993044
>>> model.score("I am from a butt article.")
-46.12711286071866
>>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, identification and taxonomy.")
-194.34473241301927
>>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, stupid butts, growth, evolution, distribution, identification and taxonomy.")
-211.3557184263918
>>> model.score("I’ve gotten to know our country so well — tremendous potential.")
-102.08910035889473
>>> # ---------------- Trimming the freq count -----------------------------
... 
>>> model.scorer.prod_freq = {prod: freq for prod, freq in model.scorer.prod_freq.items() if freq >= 3}
>>> len(model.scorer.prod_freq)
272907
>>> model.score("I am from a featured article.")
-43.74413929993044
>>> model.score("I am from a butt article.")
-46.12711286071866
>>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, identification and taxonomy.")
-194.34473241301927
>>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, stupid butts, growth, evolution, distribution, identification and taxonomy.")
-211.3557184263918
>>> model.score("I’ve gotten to know our country so well — tremendous potential.")
-102.08910035889473

OK. So I expected minor changes to the scores, but seeing no change at all was surprising.

I had a minor breakthrough. I just want to take note of it quick because I've got to leave the computer soon.

So, if you take the log_proba and divide it by the number of productions in a sentence, you get a score that makes *way* more sense. So, now I'm extracting log_proba and productions for all of the sentences in the featured article set. Next time I sit down, I'll be plotting distributions of log_proba and log_proba/productions.

Halfak renamed this task from Implement sentences datascources & experiment with normalization. to Implement sentences datascources.Nov 28 2016, 9:45 PM
Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.
Halfak removed a project: draftquality-modeling.

Sentence datasources in PR here: https://github.com/wiki-ai/revscoring/pull/291