Page MenuHomePhabricator

Deploy recommended languages for Russian, Japanese, and Portuguese
Closed, ResolvedPublic

Description

Now that the analysis is done, these need to get deployed. (Earlier languages were first deployed as part of the A/B test, which was then enabled for everyone. These just need to get deployed.)

Languages to enable by Wiki:

  • ptwiki: pt, en, ru, he, ar, zh, ko, el
  • ruwiki: ru, en, uk, ka, hy, ja, ar, he, zh
  • jawiki: ja, en, ru, ko, ar, he

The languages for enwiki don't need to change—the new analysis came up with a slightly different list, but it's close enough that consistency is better for now.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedEBernhardson
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson
DeclinedNone
DeclinedNone
ResolvedTJones
DeclinedNone
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss

Event Timeline

Change 304328 had a related patch set uploaded (by Tjones):
Enable Language ID for Russian, Japanese, Portuguese Wikipedias

https://gerrit.wikimedia.org/r/304328

scheduled for deplyoment in todays evening SWAT

Change 304328 merged by jenkins-bot:
Enable Language ID for Russian, Japanese, Portuguese Wikipedias

https://gerrit.wikimedia.org/r/304328

Why not "zh" in ja? Political issues?

Per the linked analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_and_jawiki#Japanese_Results

zh makes up somewhere around 4.31% +/- 1.72% of poorly performing queries.

Detection of chinese had a recall of 91.3% but a precision of 26.6%. Basically while it detected most of the chinese text, it also detected 3x as many non-chinese query strings as being chinese, giving incredibly poor precision. The precision and recall for the overall set improves when removing chinese from the detection.

In addition to what @EBernhardson said, the reason Chinese performs poorly with respect to Japanese is that language detection is generally harder on shorter strings, like queries, and with this particular implementation, it's harder on writing systems with more characters, because the smallish statistics set—only 3000 n-grams—doesn't come close to covering all Chinese characters. It has even fewer than 3000 distinct characters because the n-grams include one- to five-letter sequences.

I've got some ideas for improving confidence beyond what is more or less a first-past-the-post scoring system now. See T140289. I've also been thinking about models that could be more effective for Chinese while using the same framework we have now, but I have to develop and test out the ideas.

this was pushed out into production during the week of Aug 18, 2016