Decide which set of separators we have to use for TextCat ngrams
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Jan 14 2016, 6:49 PM

Description

The Perl version of TextCat uses only a simple set of non-word characters to separate words - '0-9\s'. However, in natural texts more characters could be separating words and found at the end of a word - such as parentheses, comma, period, etc. We need to test if it makes sense to amend that regular expression or maybe to use some special regexp syntax like \b or \W, to improve model quality.

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T137158 Compile and then resolve issues with TextCat A/B test data
Declined	mpopov	T134320 Analyse results of TextCat A/B test
Resolved	EBernhardson	T130321 Disable Schema:Search, since it's outdated and redundant
Resolved	mpopov	T129564 Switch Desktop data collection for dashboards to use TestSearchSatisfaction2 instead of Search schema
Resolved	EBernhardson	T134319 Turn off TextCat A/B test on the English Wikipedia on or after May 23
Resolved	debt	T134318 Verify data pipeline for TextCat A/B test on English Wikipedia
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification

Event Timeline

Smalyshev created this task.Jan 14 2016, 6:49 PM

Smalyshev assigned this task to TJones.

Smalyshev raised the priority of this task from to Medium.

Smalyshev updated the task description. (Show Details)

Smalyshev added projects: Discovery-ARCHIVED, Discovery-Search (Current work).

Smalyshev subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2016, 6:49 PM

Smalyshev moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jan 14 2016, 6:49 PM

Smalyshev moved this task from Needs triage to Search on the Discovery-ARCHIVED board.Jan 15 2016, 12:55 AM

@TJones @Smalyshev Does this block anything specific, or is this just a more along the lines of a general improvement? Would be good to know, so that we can determine whether or not it can be done after we actually get an A/B test out that uses TextCat.

• Deskana moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.Jan 21 2016, 11:24 PM

We can keep using the minimal set os separators, but the result would (might) be sub-optimal then. So, I think it is useful to do this before we do T123537. Though strictly speaking we could do without it if we had to

In T123651#1954031, @Smalyshev wrote:

We can keep using the minimal set os separators, but the result would (might) be sub-optimal then. So, I think it is useful to do this before we do T123537. Though strictly speaking we could do without it if we had to

Thanks. Marking this as blocking T123537 for now makes sense. We can revisit and maybe remove that requirement if this ends up taking significantly longer than expected.

I've tested the various punctuation options. Long story short, parens hurt, but periods help. More detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Additional_Non-Word_Characters

I'm going to commit the change to the PHP version of TextCat.

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jan 22 2016, 10:33 PM

Looks great, thanks!

In T123651#1957519, @TJones wrote:

I've tested the various punctuation options. Long story short, parens hurt, but periods help. More detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Additional_Non-Word_Characters

I'm going to commit the change to the PHP version of TextCat.

Great! Can you link to that commit here, so that once it's merged we can close this task?

I got a little over-excited on Friday and got ahead of myself. While the test is done, the language models need to be regenerated with the new separator. I've just done that, but now I'm wrestling with Gerrit. Will link as soon as I can.

Changes committed and awaiting review: https://gerrit.wikimedia.org/r/266437

Merged. (Thanks, for the review, Stas.)

Smalyshev closed this task as Resolved.Jan 26 2016, 11:07 PM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.Jan 28 2016, 6:09 PM

Decide which set of separators we have to use for TextCat ngramsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Decide which set of separators we have to use for TextCat ngrams
Closed, ResolvedPublic
Actions

Related Objects
Search...