Page MenuHomePhabricator

Bucketing & Bonuses for TextCat
Closed, ResolvedPublic

Description

Look at ways to put languages into buckets for scoring bonuses or penalties.

Three main mathematical approaches are scaling against lowest score, scaling against max possible score, and a per-bucket multiplier.

Main approaches to bucketing are giving boosts to the host language (a priori the most likely), and giving penalties to poorly-performing languages (e.g., infrequent languages that are often confused with the host language).

Also implement for Perl and PHP versions of TextCat.

Event Timeline

TJones created this task.Oct 27 2016, 3:56 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

Change 331399 had a related patch set uploaded (by Tjones):
Add ability to boost liklihood of specified languages

https://gerrit.wikimedia.org/r/331399

Write up and analysis is done.

  • Enabling most of the languages that got turned off for too many false positives in the early days can now be turned back on, so we should!
  • Enabling the basic bonus feature, and enabling it for the top 2 languages for each wiki, further improves performance. We should do that!
  • We still need to see how this feature interacts with other potential features being considered—but there's only one left: the unknown n-gram penalty (T151230).

Change 331399 merged by jenkins-bot:
Add ability to boost liklihood of specified languages

https://gerrit.wikimedia.org/r/331399

Deskana closed this task as Resolved.Feb 10 2017, 5:27 PM