Generate wikitext-based and query-based language models for TextCat
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Jan 13 2016, 7:10 PM

Description

We need to generate two sets of data models for TextCat - one based on wikitext for top 50 languages (based on number of speakers) and one based on queries for top 50 (based on logs or same as above?)

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T137158 Compile and then resolve issues with TextCat A/B test data
Declined	mpopov	T134320 Analyse results of TextCat A/B test
Resolved	EBernhardson	T130321 Disable Schema:Search, since it's outdated and redundant
Resolved	mpopov	T129564 Switch Desktop data collection for dashboards to use TestSearchSatisfaction2 instead of Search schema
Resolved	EBernhardson	T134319 Turn off TextCat A/B test on the English Wikipedia on or after May 23
Resolved	debt	T134318 Verify data pipeline for TextCat A/B test on English Wikipedia
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification

Event Timeline

Smalyshev created this task.Jan 13 2016, 7:10 PM

Smalyshev assigned this task to TJones.

Smalyshev raised the priority of this task from to Medium.

Smalyshev updated the task description. (Show Details)

Smalyshev added a project: Discovery-ARCHIVED.

Smalyshev subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2016, 7:10 PM

Smalyshev moved this task from Needs triage to Search on the Discovery-ARCHIVED board.Jan 13 2016, 7:11 PM

Smalyshev added a parent task: T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search.Jan 13 2016, 7:56 PM

Smalyshev added a project: Discovery-Search (Current work).

Smalyshev set Security to None.

• ksmith mentioned this in T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.Jan 14 2016, 5:36 PM

• Deskana moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.Jan 14 2016, 5:41 PM

Smalyshev added a subtask: T123651: Decide which set of separators we have to use for TextCat ngrams.Jan 21 2016, 11:36 PM

Smalyshev mentioned this in T123651: Decide which set of separators we have to use for TextCat ngrams.

Smalyshev mentioned this in T123558: Security review for TextCat library.Jan 22 2016, 8:16 PM

This is at least related to T121545, though some of the details are different.

Smalyshev closed subtask T123651: Decide which set of separators we have to use for TextCat ngrams as Resolved.Jan 26 2016, 11:07 PM

Smalyshev added a parent task: T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.Jan 26 2016, 11:14 PM

Smalyshev mentioned this in T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search.Jan 28 2016, 6:13 PM

Smalyshev added a parent task: T124844: Add textcat to mediawiki vendor libs.Jan 28 2016, 7:10 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jan 28 2016, 7:40 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Feb 2 2016, 5:53 PM

This is very similar to T121545. I think they should be merged.

At the moment, I've created models based on lightly cleaned up WikiText, but haven't evaluated them. They have been committed and submitted for review, too.

The mentioned patch is merged, so i'm calling this complete. It could also just be merged as duplicate like suggested by trey

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Feb 3 2016, 12:41 AM

Smalyshev closed this task as Resolved.Feb 3 2016, 1:08 AM

Generate wikitext-based and query-based language models for TextCatClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Generate wikitext-based and query-based language models for TextCat
Closed, ResolvedPublic
Actions

Related Objects
Search...