Page MenuHomePhabricator

TextCat Improvement Deployment
Closed, ResolvedPublic


Once all the subtasks are complete:

  • Commit all updated Perl and PHP versions of TextCat (done)
  • If any “confidence” scores are worth reporting, update relevant schema and DB tables to report and store scores (may need help from Mikhail and Erik) (none found so far—future work delegated to T149323 and T155670)
  • Update PHP TextCat in MediaWiki (may need help from Stas) (done)
  • Make necessary changes to the TextCat config in Cirrus to optimize existing languages (done)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Parameter settings are listed and explained in the Final Summary & Recommendations for TextCat Improvements.

  • Models: We should use 9K n-gram models in both LM-query/ and LM/ .
  • Maximum Returned Languages and Results Ratio: The maximum returned languages allowed should be 1, and the results ratio should be 1.06.
  • Minimum Input Length and Max Proportion of Max Score: The minimum input length allowed should be 3, and the max proportion of max score should be 0.85.
  • Languages (per wiki): Below are the optimized languages to consider for each wiki; the two most common languages (the first two in each list) should be boosted by 14% (i.e., 0.14).
    • dewiki: German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
    • enwiki: English, Chinese, Spanish, Arabic, German, Persian, French, Indonesian, Polish, Russian, Vietnamese, Italian, Japanese, Portuguese, Czech, Bengali, Croatian, Hebrew, Norwegian, Afrikaans, Icelandic, Tagalog, Thai, Hungarian, Irish, Korean, Ukrainian, Urdu, Hindi, Greek, Telugu, Georgian
    • eswiki: Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, French, German, Arabic, Japanese
    • frwiki: French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Thai, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton, Greek, Hebrew, Korean
    • itwiki: Italian, English, German, Russian, Arabic, Chinese, Polish, Greek, Korean
    • jawiki: Japanese, English, Chinese, Korean, German, Arabic, Hebrew
    • nlwiki: Dutch, English, French, German, Spanish, Latin, Chinese, Polish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Greek, Hebrew, Japanese, Russian
    • ptwiki: Portuguese, English, Tagalog, Russian, French, Hebrew, Arabic, Chinese, Korean, Greek
    • ruwiki: Russian, English, Ukrainian, German, Georgian, Armenian, Latvian, Japanese, Finnish, Spanish, Arabic, Hebrew, Chinese

Perl and PHP versions of TextCat are up-to-date.

Unfortunately, no useful confidence measures were found, though tasks still exist to investigate further: T155670 & T149323.

Change 334728 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

Change 334729 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

Change 335043 had a related patch set uploaded (by DCausse):
Bump textcat version to 1.2.0

Change 335043 merged by jenkins-bot:
Bump textcat version to 1.2.0

Change 334728 merged by jenkins-bot:
Deploy TextCat Improvements

TJones updated the task description. (Show Details)

Change 334729 merged by jenkins-bot:
Deploy TextCat Improvements

Mentioned in SAL (#wikimedia-operations) [2017-02-08T00:16:27Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:334729|Deploy TextCat Improvements]] T149324 T142140 (duration: 00m 45s)

It's live and working. And now, among other things, French detection is enabled on enwiki! The new config errs heavily in favor of English over French on enwiki, so very short and not overly distinctive French queries will not be sent to frwiki, but really obvious stuff will.

Deskana subscribed.

Nice! Glad to see this go out.