Recent work (subtasks of T140289) has shown that we can get better performance out of TextCat with larger language models. We currently have 5K models (configured to use as 3K) in production. 9K models seem to be the best option, but we can deploy the 10K models I've been using for testing and development.
Related Gerrit Patches:
|wikimedia/textcat : master||Update PHP TextCat Models to 10K n-grams|
|Open||None||T118278 EPIC: Improve Language Identification for use in Cirrus Search|
|Resolved||TJones||T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection|
|Resolved||TJones||T149324 TextCat Improvement Deployment|
|Resolved||TJones||T155672 Deploy 10K models for TextCat (PHP & Perl)|