Page MenuHomePhabricator

Retrain Chinese query-based language ID models
Open, MediumPublic


As part of the investigation into T174116, we discovered that the Chinese query-based language models have a lot of long strings of dashes and periods in them, which should be removed.

Once the model is updated, it should be tested against the TextCat regression test sets to make sure there's no big change in overall performance, and against the punctuation-heavy sample from T174116, adn against the specific known queries (... being the main one) to make sure everything works as expected.