Recent work (subtasks of T140289) has shown that we can get better performance out of TextCat with larger language models. We currently have 5K models (configured to use as 3K) in production. 9K models seem to be the best option, but we can deploy the 10K models I've been using for testing and development.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Update PHP TextCat Models to 10K n-grams | wikimedia/textcat | master | +1 M -79 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T118278 [EPIC] Improve Language Identification for use in Cirrus Search | |||
Resolved | TJones | T140289 Investigate Improvements and Confidence Measures for TextCat Language Detection | |||
Resolved | TJones | T149324 TextCat Improvement Deployment | |||
Resolved | TJones | T155672 Deploy 10K models for TextCat (PHP & Perl) |
Event Timeline
Comment Actions
Change 333683 had a related patch set uploaded (by Tjones):
Update PHP TextCat Models to 10K n-grams
Comment Actions
This has been deployed to the TextCat repository and will be deployed out into production this week with other deployments.