Can't work on all of them at once, so continue down the list. See parent task T121541.
Dropping Indonesian because we're working from a new volume-based list from the search metrics dashboard.
Can't work on all of them at once, so continue down the list. See parent task T121541.
Dropping Indonesian because we're working from a new volume-based list from the search metrics dashboard.
English is done, and it came out similar to the previous ZRR-based corpus (which also included API calls and no anti-bot precautions).
Portuguese is done. Portuguese typos often look a lot like Spanish typos! Nonetheless, ptwiki's low-performing queries are mostly in Portuguese (>90%), so accuracy is very high (> 95%).
Russian is done. About 77% of poor-performing ruwiki queries are in Russian, with a sizable amount in English (>10%) and Ukrainian (<5%), and a moderately long tail of other languages. Overall accuracy is good (>90%), despite not having models for a fair number of languages in the long tail.
Japanese is done. It's mostly Japanese (big surprise!), with a dollop of English, and a bit of Chinese. Unfortunately, the Chinese gets too many false positives on Japanese queries, so we have to disable it. (Maybe that TextCat Confidence thing would help.)