Page MenuHomePhabricator

Implement a basic scoring strategy for PCFGs
Closed, ResolvedPublic

Description

This task is done when there's a python library that implements something that can score a sentence by it's likelihood of appearing in a corpus.

Event Timeline

Halfak created this task.Sep 21 2016, 10:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2016, 10:35 PM
Halfak updated the task description. (Show Details)Sep 21 2016, 10:48 PM

https://github.com/halfak/kasami I produced this largely from reviewing and simplifying code found in https://github.com/aetilley/pcfg. Most of my notes are in T144636.

I'd like to see if @aetilley has time to review the score() method here: https://github.com/halfak/kasami/blob/master/kasami/tree_scorer.py#L29

Here's a copy-paste of the relevant lines of code:

probas = [self.prod_freq.get(prod, 0.5) /
          self.source_freq.get(prod.source, 1)
          for prod in tree]
return sum(log(proba) for proba in probas)

Essentially, probas == the frequency of the production / the frequency of the source. If the production has not been seen before, it is given a frequency of 0.5 (so that we avoid zero probabilities). Similarly, if the source has not been seen before, it is given a frequency of 1. The sum of log(proba) is returned so that we don't get into precision issues. You can always convert back to raw likelihood by using exp().

>>> 1 * 10 ** -19 + 1
1.0
>>> 1 * 10 ** -18 + 1
1.0
>>> 1 * 10 ** -17 + 1
1.0
>>> 1 * 10 ** -16 + 1
1.0
>>> 1 * 10 ** -15 + 1
1.000000000000001
Halfak claimed this task.Sep 21 2016, 11:25 PM
Halfak moved this task from Active to Review on the Scoring-platform-team (Current) board.
Halfak closed this task as Resolved.Sep 28 2016, 9:40 PM