TextCat Improvement Deployment
Closed, ResolvedPublic

Description

Once all the subtasks are complete:

  • Commit all updated Perl and PHP versions of TextCat (done)
  • If any “confidence” scores are worth reporting, update relevant schema and DB tables to report and store scores (may need help from Mikhail and Erik) (none found so far—future work delegated to T149323 and T155670)
  • Update PHP TextCat in MediaWiki (may need help from Stas) (done)
  • Make necessary changes to the TextCat config in Cirrus to optimize existing languages (done)
TJones created this task.Oct 27 2016, 3:58 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt moved this task from Needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

Parameter settings are listed and explained in the Final Summary & Recommendations for TextCat Improvements.

  • Models: We should use 9K n-gram models in both LM-query/ and LM/ .
  • Maximum Returned Languages and Results Ratio: The maximum returned languages allowed should be 1, and the results ratio should be 1.06.
  • Minimum Input Length and Max Proportion of Max Score: The minimum input length allowed should be 3, and the max proportion of max score should be 0.85.
  • Languages (per wiki): Below are the optimized languages to consider for each wiki; the two most common languages (the first two in each list) should be boosted by 14% (i.e., 0.14).
    • dewiki: German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
    • enwiki: English, Chinese, Spanish, Arabic, German, Persian, French, Indonesian, Polish, Russian, Vietnamese, Italian, Japanese, Portuguese, Czech, Bengali, Croatian, Hebrew, Norwegian, Afrikaans, Icelandic, Tagalog, Thai, Hungarian, Irish, Korean, Ukrainian, Urdu, Hindi, Greek, Telugu, Georgian
    • eswiki: Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, French, German, Arabic, Japanese
    • frwiki: French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Thai, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton, Greek, Hebrew, Korean
    • itwiki: Italian, English, German, Russian, Arabic, Chinese, Polish, Greek, Korean
    • jawiki: Japanese, English, Chinese, Korean, German, Arabic, Hebrew
    • nlwiki: Dutch, English, French, German, Spanish, Latin, Chinese, Polish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Greek, Hebrew, Japanese, Russian
    • ptwiki: Portuguese, English, Tagalog, Russian, French, Hebrew, Arabic, Chinese, Korean, Greek
    • ruwiki: Russian, English, Ukrainian, German, Georgian, Armenian, Latvian, Japanese, Finnish, Spanish, Arabic, Hebrew, Chinese

Perl and PHP versions of TextCat are up-to-date.

Unfortunately, no useful confidence measures were found, though tasks still exist to investigate further: T155670 & T149323.

Change 334728 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

https://gerrit.wikimedia.org/r/334728

Change 334729 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

https://gerrit.wikimedia.org/r/334729

Change 335043 had a related patch set uploaded (by DCausse):
Bump textcat version to 1.2.0

https://gerrit.wikimedia.org/r/335043

Change 335043 merged by jenkins-bot:
Bump textcat version to 1.2.0

https://gerrit.wikimedia.org/r/335043

Change 334728 merged by jenkins-bot:
Deploy TextCat Improvements

https://gerrit.wikimedia.org/r/334728

TJones edited the task description. (Show Details)Jan 31 2017, 6:07 PM
TJones edited the task description. (Show Details)

Change 334729 merged by jenkins-bot:
Deploy TextCat Improvements

https://gerrit.wikimedia.org/r/334729

Mentioned in SAL (#wikimedia-operations) [2017-02-08T00:16:27Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:334729|Deploy TextCat Improvements]] T149324 T142140 (duration: 00m 45s)

TJones edited the task description. (Show Details)Feb 8 2017, 4:21 PM
TJones added a comment.Feb 8 2017, 4:25 PM

It's live and working. And now, among other things, French detection is enabled on enwiki! The new config errs heavily in favor of English over French on enwiki, so very short and not overly distinctive French queries will not be sent to frwiki, but really obvious stuff will.

Deskana closed this task as "Resolved".Feb 10 2017, 5:22 PM
Deskana added a subscriber: Deskana.

Nice! Glad to see this go out.