Page MenuHomePhabricator

Implement a basic scoring strategy for PCFGs
Closed, ResolvedPublic

Description

This task is done when there's a python library that implements something that can score a sentence by it's likelihood of appearing in a corpus.

Event Timeline

https://github.com/halfak/kasami I produced this largely from reviewing and simplifying code found in https://github.com/aetilley/pcfg. Most of my notes are in T144636.

I'd like to see if @aetilley has time to review the score() method here: https://github.com/halfak/kasami/blob/master/kasami/tree_scorer.py#L29

Here's a copy-paste of the relevant lines of code:

probas = [self.prod_freq.get(prod, 0.5) /
          self.source_freq.get(prod.source, 1)
          for prod in tree]
return sum(log(proba) for proba in probas)

Essentially, probas == the frequency of the production / the frequency of the source. If the production has not been seen before, it is given a frequency of 0.5 (so that we avoid zero probabilities). Similarly, if the source has not been seen before, it is given a frequency of 1. The sum of log(proba) is returned so that we don't get into precision issues. You can always convert back to raw likelihood by using exp().

>>> 1 * 10 ** -19 + 1
1.0
>>> 1 * 10 ** -18 + 1
1.0
>>> 1 * 10 ** -17 + 1
1.0
>>> 1 * 10 ** -16 + 1
1.0
>>> 1 * 10 ** -15 + 1
1.000000000000001