implement revision.sentences, revision.parent.sentences and revision.diff.sentences.
revisions.diff.sentences will contain two lists:
- old sentences
- new sentences
Normalization should make the sentences parse-able by bllip and spaCy
implement revision.sentences, revision.parent.sentences and revision.diff.sentences.
revisions.diff.sentences will contain two lists:
Normalization should make the sentences parse-able by bllip and spaCy
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T148038 [Epic] Build draft quality model (spam, vandalism, attack, or OK) | |||
Open | None | T144636 [Epic] Implement PCFG features for editquality and draftquality | |||
Resolved | Halfak | T151819 Analyze differentiation of FA, Spam, Vandalism, and Attack models/sentences. | |||
Resolved | Halfak | T148037 Generate PCFG sentence models | |||
Resolved | Halfak | T148034 Sentence bank for vandalism | |||
Resolved | Halfak | T148035 Sentence bank for personal attacks | |||
Resolved | Halfak | T148033 Sentence bank for Featured Articles | |||
Resolved | Halfak | T148032 Sentence bank for spam | |||
Resolved | Halfak | T148867 Implement sentences datascources |
OK. So I've been looking at getting clean sentences out of Wikipedia articles. I get a lot of nonsense around tables. It would be nice to have a good way to just parse them out before processing sentences.
When bllip needs to parse a non-sentence, it takes a long time.
>>> start = time.time();bllip_rrp.parse('{| class="infobox" style="width: 280px;" |- | {|')[0];print(time.time() - start) ScoredParse("(S1 (S (-LRB- -LCB-) (NP (NNP |)) (VP (VBZ class=) (NP (`` ``) (NP (NN infobox)) ('' '') (CC style=) (`` ``) (NP (NP (NN width)) (: :) (NP (NN 280px)) (: ;)) ('' '') (NP (NP (JJ |-) (NN |)) (-LRB- -LCB-) (NP (NNP |)))))))", parser_score=-366.92503034039254, reranker_score=-108.16146420553363) 4.158325672149658
(note that the above took 4.16 seconds) I couldn't find a good way to limit bllip so that it only considers parses that are somewhat likely. But we could use a timeout on the function call to stop parsing if it takes longer than 0.1 seconds or something like that.
With spaCy, the parse of non-sentences is fast.
>>> start = time.time();print(spacy_norm.normalize_tree(parse('{| class="infobox" style="width: 280px;" |- | {|')));print(time.time() - start) (ROOT (nmod (XX '|-') (SP ' ')) (nmod (XX '|') (SP ' ')) (-LRB- '{') (NN '|')) 0.003621816635131836
But it also looks like the trees get really weird. I'll need to look into that too.
OK. I've managed to build a simple sub-parser for handling tables. Essentially, I'll parse an entire table into one giant "sentence". This helps for using mwparserfromhell's strip_code() which will remove a whole table. Regretfully, the parser gets confused with image links.
E.g.
>>> import mwparserfromhell as mwp >>> mwp.parse("[[File:Foo|thumb|alt=A drawing of a fly from facing up|derp]]").strip_code() 'thumb|alt=A drawing of a fly from facing up|derp'
We might want to do out own normalizations to handle links like this.
Looks like there's no easy solution here. See https://github.com/earwig/mwparserfromhell/issues/169
OK. I've implemented some better sentence parsing in deltas. See https://github.com/halfak/deltas/commit/85fe0b2bce4a6fbe7b945a1776ce9798b0951132
I think that, next, we'll want to run this on one of the proposed PCFG sources to try to get some sentences extracted.
WIP for sentence extraction in revscoring: https://github.com/wiki-ai/revscoring/pull/291
I have a working scorer model trained on Featured Article sentences.
$ ./utility score models/enwiki.fa_sentence.model 'Don Henley lends his vocals shadowing lead singer Steven Tyler in parts of this song.' 2016-11-10 09:35:06,647 INFO:wikigrammar.utilities.score -- Loading model... -136.25691142784373 $ ./utility score models/enwiki.fa_sentence.model 'Pentare is a world-class provider of pool supplies, filters, and pump equipment.' 2016-11-10 09:37:35,289 INFO:wikigrammar.utilities.score -- Loading model... -123.57435948374528
I'll need to do some data analysis to see where we are and are not successfully differentiating good from bad sentences. In the example above, a sentence from a featured article scores worse than an a very POV/advertisey sentence.
I did some work yesterday to check on memory usage and the story isn't very good. Here's some notes I took in IRC.
[16:28:47] <halfak> OK... So the PCGF pattern seems to be struggling with a major constraint [16:29:06] <halfak> Loading spacy's parser into memory requires 1.7 GB [16:29:17] <halfak> That would double our memory footprint [16:44:55] <halfak> And it looks like my FA article PCFG is taking up an additional 1.5GB of memory. [16:45:10] <halfak> I wonder if we could trim that down by collapsing all proper nouns. [16:58:37] <halfak> OK just wrote some code for that and I'm building a new model. I think we'll want to do something for numbers too, but that'll need to wait for tomorrow. [16:58:38] <halfak> o/
Well. I built the new model that collapses all proper nouns and that brought us down from 3.2GB RES to 3.0GB RES. I'm looking at numbers next.
It's weird that the pickled file sizes are so small, but the in-memory space is so large.
$ ls -alh total 344M drwxrwxr-x 2 halfak halfak 4.0K Nov 21 16:58 . drwxrwxr-x 6 halfak halfak 4.0K Nov 7 10:26 .. -rw-rw-r-- 1 halfak halfak 181M Nov 7 11:42 enwiki.fa_sentence.model -rw-rw-r-- 1 halfak halfak 164M Nov 21 17:53 enwiki.fa_sentence.normalized.model
So, it looks like collapsing numbers didn't really matter that much. So, I decided to dig into the production frequencies.
>>> sum(1 for k, v in prod_freqs if v > 0) 1563150 >>> sum(1 for k, v in prod_freqs if v > 1) 424316 >>> sum(1 for k, v in prod_freqs if v > 2) 272907 >>> sum(1 for k, v in prod_freqs if v > 3) 209849 >>> sum(1 for k, v in prod_freqs if v > 4) 174048 >>> sum(1 for k, v in prod_freqs if v > 5) 150498 >>> sum(1 for k, v in prod_freqs if v > 6) 133783
It looks like we can drop our data size by an order of magnitude if we only allow productions that happen 3 or more times.
Let's look at the most common productions.
>>> print("\n".join(str(prod) + "\t" + str(freq) for prod, freq in prod_freqs[-20:])) (`` '"') 176059 (IN 'for') 181485 (POS "'s") 205679 (DT 'The') 216007 (pobj DT NN) 229549 (prep IN NNP) 240849 (IN 'to') 265867 (HYPH '-') 275290 (TO 'to') 282689 (VBD 'was') 283549 (DT 'a') 449022 (IN 'in') 539483 (CC 'and') 705146 (IN 'of') 837665 (CD 'CD') 978336 (. '.') 1099003 (, ',') 1517796 (DT 'the') 1550613 (prep IN pobj) 2149169 (NNP 'NNP') 3755058
OK. Well, not too much of a surprise there. Note that we have a huge 3.7m observations of our collapsed Proper nouns (NNP) and 1m observations of our collapsed quantities.
>>> model.score("I am from a featured article.") -43.74413929993044 >>> model.score("I am from a butt article.") -46.12711286071866 >>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, identification and taxonomy.") -194.34473241301927 >>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, stupid butts, growth, evolution, distribution, identification and taxonomy.") -211.3557184263918 >>> model.score("I’ve gotten to know our country so well — tremendous potential.") -102.08910035889473 >>> # ---------------- Trimming the freq count ----------------------------- ... >>> model.scorer.prod_freq = {prod: freq for prod, freq in model.scorer.prod_freq.items() if freq >= 3} >>> len(model.scorer.prod_freq) 272907 >>> model.score("I am from a featured article.") -43.74413929993044 >>> model.score("I am from a butt article.") -46.12711286071866 >>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, identification and taxonomy.") -194.34473241301927 >>> model.score("Biology is a natural science concerned with the study of life and living organisms, including their structure, stupid butts, growth, evolution, distribution, identification and taxonomy.") -211.3557184263918 >>> model.score("I’ve gotten to know our country so well — tremendous potential.") -102.08910035889473
OK. So I expected minor changes to the scores, but seeing no change at all was surprising.
I had a minor breakthrough. I just want to take note of it quick because I've got to leave the computer soon.
So, if you take the log_proba and divide it by the number of productions in a sentence, you get a score that makes *way* more sense. So, now I'm extracting log_proba and productions for all of the sentences in the featured article set. Next time I sit down, I'll be plotting distributions of log_proba and log_proba/productions.