⚓ T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification

Subject	Repo	Branch	Lines +/-
Textcat search satisfaction subtest for multiple wikis	mediawiki/extensions/WikimediaEvents	master	+48 -7
Textcat search satisfaction subtest for multiple wikis	mediawiki/extensions/WikimediaEvents	wmf/1.28.0-wmf.6	+48 -7
Add implementation for TextCat language detection	mediawiki/extensions/CirrusSearch	master	+181 -69

Status	Assigned	Task
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification

TJones created this task.Dec 15 2015, 5:50 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:50 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones added a subtask: T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis.Dec 15 2015, 5:50 PM

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

TJones added a subtask: T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search.Dec 22 2015, 5:22 PM

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:23 PM

• Deskana set Security to None.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

Liuxinyu970226 subscribed.Dec 31 2015, 9:26 AM

• Deskana triaged this task as Medium priority.Jan 14 2016, 5:44 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

Smalyshev closed subtask T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search as Resolved.Jan 28 2016, 6:03 PM

Change 260164 had a related patch set uploaded (by Smalyshev):
Add implementation for TextCat language detection

https://gerrit.wikimedia.org/r/260164

gerritbot added a project: Patch-For-Review.Feb 9 2016, 11:19 PM

Change 260164 merged by jenkins-bot:
Add implementation for TextCat language detection

https://gerrit.wikimedia.org/r/260164

EBernhardson removed a project: Discovery-Search (Current work).Feb 16 2016, 11:12 PM

• ksmith moved this task from On Sprint Board to Search on the Discovery-ARCHIVED board.Feb 16 2016, 11:24 PM

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-03-01_(1.27.0-wmf.15)).Feb 17 2016, 10:02 PM

• Deskana added a project: Discovery-Search.Apr 12 2016, 10:18 PM

• Deskana moved this task from needs triage to Up Next on the Discovery-Search board.

@TJones I was looking at these tasks, and wondering if the blockers here are really blockers for running an extra A/B test. @EBernhardson and I think they may not be, and that based on your work so far that we could run a test right now, but we don't know this stuff as well as you do, so we'd like to ask you. Thoughts?

Let's bump up the priority somewhat.

In T121543#2324750, @Deskana wrote:

@TJones I was looking at these tasks, and wondering if the blockers here are really blockers for running an extra A/B test. @EBernhardson and I think they may not be, and that based on your work so far that we could run a test right now, but we don't know this stuff as well as you do, so we'd like to ask you. Thoughts?

They are and they aren't—what a helpful answer!

The tasks are really too general, and at the earliest stage I divided everything into English and not-English until we figured out whether it made sense to pursue language ID in general.

The specific blocking tasks do need to be done, but not for all languages at once. For French, Spanish, Italian, and German Wikipedias, we aren't blocked by T121541 specifically, but by the subtask T132466, which is in "needs review", but is basically done.

The language lists for each of those wikis is available in Phab ticket T132466, and that's enough to run the A/B tests parallel to the test we've run for enwiki.

There's still the question of recall-focus vs precision-focus (see T134431 ("needs review", but basically done) and T136034 (to do)), but we can do all the A/B tests with the same precision-focus we've had so far and get a better idea of how well this can work.

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.May 31 2016, 10:12 PM

EBernhardson removed a project: Patch-For-Review.May 31 2016, 10:16 PM

debt added a subtask: T137163: Part Deux: TextCat A/B test for Language Identification - specification.Jun 6 2016, 10:56 PM

EBernhardson claimed this task.Jun 9 2016, 10:05 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jun 16 2016, 4:49 PM

Change 293432 had a related patch set uploaded (by EBernhardson):
Textcat search satisfaction subtest for multiple wikis

https://gerrit.wikimedia.org/r/293432

gerritbot added a project: Patch-For-Review.Jun 16 2016, 6:04 PM

Change 293432 merged by jenkins-bot:
Textcat search satisfaction subtest for multiple wikis

https://gerrit.wikimedia.org/r/293432

Change 294773 had a related patch set uploaded (by EBernhardson):
Textcat search satisfaction subtest for multiple wikis

https://gerrit.wikimedia.org/r/294773

Change 294773 merged by jenkins-bot:
Textcat search satisfaction subtest for multiple wikis

https://gerrit.wikimedia.org/r/294773

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jun 20 2016, 8:42 PM

• Deskana closed subtask T137163: Part Deux: TextCat A/B test for Language Identification - specification as Resolved.Jul 7 2016, 7:20 PM

debt closed this task as Resolved.Jul 21 2016, 4:02 PM

Liuxinyu970226 unsubscribed.Jul 22 2016, 12:42 AM

debt closed subtask T121541: Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis as Resolved.Apr 5 2017, 2:51 PM

Do an A/B Tests on Other Wikis with TextCat for Language Identification
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Do an A/B Tests on Other Wikis with TextCat for Language IdentificationClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Do an A/B Tests on Other Wikis with TextCat for Language Identification
Closed, ResolvedPublic
Actions

Related Objects
Search...