This task is done when there's a python library that implements something that can score a sentence by it's likelihood of appearing in a corpus.
|Open||None||T144636 [Epic] Implement PCFG features for editquality and draftquality|
|Resolved||Halfak||T146335 Implement a basic scoring strategy for PCFGs|
I'd like to see if @aetilley has time to review the score() method here: https://github.com/halfak/kasami/blob/master/kasami/tree_scorer.py#L29
Here's a copy-paste of the relevant lines of code:
probas = [self.prod_freq.get(prod, 0.5) / self.source_freq.get(prod.source, 1) for prod in tree] return sum(log(proba) for proba in probas)
Essentially, probas == the frequency of the production / the frequency of the source. If the production has not been seen before, it is given a frequency of 0.5 (so that we avoid zero probabilities). Similarly, if the source has not been seen before, it is given a frequency of 1. The sum of log(proba) is returned so that we don't get into precision issues. You can always convert back to raw likelihood by using exp().
>>> 1 * 10 ** -19 + 1 1.0 >>> 1 * 10 ** -18 + 1 1.0 >>> 1 * 10 ** -17 + 1 1.0 >>> 1 * 10 ** -16 + 1 1.0 >>> 1 * 10 ** -15 + 1 1.000000000000001