Page MenuHomePhabricator

Consider Additional Unknown n-gram Penalty
Closed, ResolvedPublic

Description

While looking at maximum returned languages and results ratio (T149321) I accidentally found that an extra penalty beyond the current model size could lead to better results. Investigate this more thoroughly over a wider range of options.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Full details on MediaWiki.

Summary:

  • A per-wiki unknown n-gram penalty makes a small difference in overall F0.5 score for most corpora.
  • The unknown n-gram penalty can't be optimized across all the corpora; they each want their own value.
    • The "proportional penalty" isn't terribly useful, either.
  • It's not worth pursuing, but it's worth keeping in our bag of tricks should it be useful in the future.
  • It's only a few lines of code, so I'm not going to clutter up TextCat with more stuff we aren't going to use.
TJones renamed this task from Add Additional Unknown n-gram Penalty to Consider Additional Unknown n-gram Penalty.Jan 18 2017, 9:20 PM

@TJones, thanks for he explanation you gave me yesterday.
I consider this task done, a new one should probably be created to actually implement this idea in TextCat?

@TJones, thanks for he explanation you gave me yesterday.

Very welcome! Thanks very much for looking over it.

I consider this task done, a new one should probably be created to actually implement this idea in TextCat?

Because it didn't have much useful effect—so we aren't going to use it—and the code it fairly trivial, I decided not to implement it. If anyone thinks it might be nice to have anyway, I can implement it anyway. I could also test it against the old production baseline just to see how useful it is without all the other improvements, which have pushed accuracy up so high that it's hard to make much more progress in many cases.

If I do implement it, I can do it on this task, as I've done with the others. (This is the first one that didn't show enough improvement to implement and use.)

debt subscribed.

We decided not to implement this because it didn't have enough useful effect, see @TJones' note above, closing as resolved since work was done on this ticket.