Page MenuHomePhabricator

TextCat Improvement Deployment
Closed, ResolvedPublic


Once all the subtasks are complete:

  • Commit all updated Perl and PHP versions of TextCat (done)
  • If any “confidence” scores are worth reporting, update relevant schema and DB tables to report and store scores (may need help from Mikhail and Erik) (none found so far—future work delegated to T149323 and T155670)
  • Update PHP TextCat in MediaWiki (may need help from Stas) (done)
  • Make necessary changes to the TextCat config in Cirrus to optimize existing languages (done)


Related Gerrit Patches:
operations/mediawiki-config : masterDeploy TextCat Improvements
mediawiki/extensions/CirrusSearch : masterDeploy TextCat Improvements
mediawiki/vendor : masterBump textcat version to 1.2.0

Event Timeline

TJones created this task.Oct 27 2016, 3:58 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptOct 27 2016, 3:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt moved this task from needs triage to Up Next on the Discovery-Search board.Oct 27 2016, 8:37 PM

Parameter settings are listed and explained in the Final Summary & Recommendations for TextCat Improvements.

  • Models: We should use 9K n-gram models in both LM-query/ and LM/ .
  • Maximum Returned Languages and Results Ratio: The maximum returned languages allowed should be 1, and the results ratio should be 1.06.
  • Minimum Input Length and Max Proportion of Max Score: The minimum input length allowed should be 3, and the max proportion of max score should be 0.85.
  • Languages (per wiki): Below are the optimized languages to consider for each wiki; the two most common languages (the first two in each list) should be boosted by 14% (i.e., 0.14).
    • dewiki: German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
    • enwiki: English, Chinese, Spanish, Arabic, German, Persian, French, Indonesian, Polish, Russian, Vietnamese, Italian, Japanese, Portuguese, Czech, Bengali, Croatian, Hebrew, Norwegian, Afrikaans, Icelandic, Tagalog, Thai, Hungarian, Irish, Korean, Ukrainian, Urdu, Hindi, Greek, Telugu, Georgian
    • eswiki: Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, French, German, Arabic, Japanese
    • frwiki: French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Thai, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton, Greek, Hebrew, Korean
    • itwiki: Italian, English, German, Russian, Arabic, Chinese, Polish, Greek, Korean
    • jawiki: Japanese, English, Chinese, Korean, German, Arabic, Hebrew
    • nlwiki: Dutch, English, French, German, Spanish, Latin, Chinese, Polish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Greek, Hebrew, Japanese, Russian
    • ptwiki: Portuguese, English, Tagalog, Russian, French, Hebrew, Arabic, Chinese, Korean, Greek
    • ruwiki: Russian, English, Ukrainian, German, Georgian, Armenian, Latvian, Japanese, Finnish, Spanish, Arabic, Hebrew, Chinese

Perl and PHP versions of TextCat are up-to-date.

Unfortunately, no useful confidence measures were found, though tasks still exist to investigate further: T155670 & T149323.

Change 334728 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

Change 334729 had a related patch set uploaded (by Tjones):
Deploy TextCat Improvements

Change 335043 had a related patch set uploaded (by DCausse):
Bump textcat version to 1.2.0

Change 335043 merged by jenkins-bot:
Bump textcat version to 1.2.0

Change 334728 merged by jenkins-bot:
Deploy TextCat Improvements

TJones updated the task description. (Show Details)Jan 31 2017, 6:07 PM
TJones updated the task description. (Show Details)

Change 334729 merged by jenkins-bot:
Deploy TextCat Improvements

Mentioned in SAL (#wikimedia-operations) [2017-02-08T00:16:27Z] <thcipriani@tin> Synchronized wmf-config: SWAT: [[gerrit:334729|Deploy TextCat Improvements]] T149324 T142140 (duration: 00m 45s)

TJones updated the task description. (Show Details)Feb 8 2017, 4:21 PM
TJones added a comment.Feb 8 2017, 4:25 PM

It's live and working. And now, among other things, French detection is enabled on enwiki! The new config errs heavily in favor of English over French on enwiki, so very short and not overly distinctive French queries will not be sent to frwiki, but really obvious stuff will.

Deskana closed this task as Resolved.Feb 10 2017, 5:22 PM
Deskana added a subscriber: Deskana.

Nice! Glad to see this go out.