Deploy recommended languages for Russian, Japanese, and Portuguese
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Aug 8 2016, 5:38 PM

Description

Now that the analysis is done, these need to get deployed. (Earlier languages were first deployed as part of the A/B test, which was then enabled for everyone. These just need to get deployed.)

Languages to enable by Wiki:

ptwiki: pt, en, ru, he, ar, zh, ko, el
ruwiki: ru, en, uk, ka, hy, ja, ar, he, zh
jawiki: ja, en, ru, ko, ar, he

The languages for enwiki don't need to change—the new analysis came up with a slightly different list, but it's close enough that consistency is better for now.

Details

	Subject	Repo	Branch	Lines +/-
	Enable Language ID for Russian, Japanese, Portuguese Wikipedias	operations/mediawiki-config	master	+16 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification
Declined	None	T121544 Create Manually "Curated" Training Sets for Top N Languages for Language Identification
Declined	None	T121546 Experiment with Equalizing Training Set Sizes for Language Identification
Resolved	TJones	T121545 Wikipedia-Text–Based Language Models for Language Identification
Declined	None	T121547 Improve Language Identification Training Data via Application of Language Models to the Training Data
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"

Event Timeline

TJones created this task.Aug 8 2016, 5:38 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 8 2016, 5:38 PM

debt moved this task from needs triage to Up Next on the Discovery-Search board.Aug 8 2016, 7:13 PM

debt added a subscriber: CKoerner_WMF.

debt removed a project: Discovery-Search.Aug 11 2016, 4:51 PM

Change 304328 had a related patch set uploaded (by Tjones):
Enable Language ID for Russian, Japanese, Portuguese Wikipedias

https://gerrit.wikimedia.org/r/304328

gerritbot added a project: Patch-For-Review.Aug 11 2016, 9:30 PM

TJones moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Aug 11 2016, 9:40 PM

scheduled for deplyoment in todays evening SWAT

Change 304328 merged by jenkins-bot:
Enable Language ID for Russian, Japanese, Portuguese Wikipedias

https://gerrit.wikimedia.org/r/304328

debt created subtask T143355: request translations for 'showing results from'.Aug 18 2016, 8:02 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Aug 18 2016, 9:19 PM

Liuxinyu970226 subscribed.Aug 19 2016, 12:18 AM

Why not "zh" in ja? Political issues?

Per the linked analysis: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_and_jawiki#Japanese_Results

zh makes up somewhere around 4.31% +/- 1.72% of poorly performing queries.

Detection of chinese had a recall of 91.3% but a precision of 26.6%. Basically while it detected most of the chinese text, it also detected 3x as many non-chinese query strings as being chinese, giving incredibly poor precision. The precision and recall for the overall set improves when removing chinese from the detection.

In addition to what @EBernhardson said, the reason Chinese performs poorly with respect to Japanese is that language detection is generally harder on shorter strings, like queries, and with this particular implementation, it's harder on writing systems with more characters, because the smallish statistics set—only 3000 n-grams—doesn't come close to covering all Chinese characters. It has even fewer than 3000 distinct characters because the n-grams include one- to five-letter sequences.

I've got some ideas for improving confidence beyond what is more or less a first-past-the-post scoring system now. See T140289. I've also been thinking about models that could be more effective for Chinese while using the same framework we have now, but I have to develop and test out the ideas.

debt closed subtask T143355: request translations for 'showing results from' as Resolved.Aug 22 2016, 9:43 PM

this was pushed out into production during the week of Aug 18, 2016

Liuxinyu970226 unsubscribed.Aug 22 2016, 10:46 PM

debt mentioned this in T129627: Automatically switch to user's query language if user types characters associated with only one language.Sep 23 2016, 6:34 PM

Deploy recommended languages for Russian, Japanese, and PortugueseClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy recommended languages for Russian, Japanese, and Portuguese
Closed, ResolvedPublic
Actions

Related Objects
Search...