Convert TextCat to PHP Library for Language Identification in Cirrus Search
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Dec 15 2015, 5:42 PM

Description

TextCat currently seems more promising than the baseline ES Plugin we've been using. It's in very old Perl and should be converted to PHP.

Stas has already started working on converting TextCat to PHP for use in Cirrus Search (available on GitHub: https://github.com/smalyshev/textcat), and he and Erik have been brainstorming on ways of making it more efficient, too. It needs some testing (e.g., Unicode compatibility) and comparison to the Perl version (i.e., same results on building modela and running on test queries).

Rough estimate: < 1 week

Related Objects
Search...

Status	Assigned	Task
Resolved	EBernhardson	T137158 Compile and then resolve issues with TextCat A/B test data
Declined	mpopov	T134320 Analyse results of TextCat A/B test
Resolved	EBernhardson	T130321 Disable Schema:Search, since it's outdated and redundant
Resolved	mpopov	T129564 Switch Desktop data collection for dashboards to use TestSearchSatisfaction2 instead of Search schema
Resolved	EBernhardson	T134319 Turn off TextCat A/B test on the English Wikipedia on or after May 23
Resolved	debt	T134318 Verify data pipeline for TextCat A/B test on English Wikipedia
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	• dcausse	T121540 Investigate Updating Cybozu / ES Plugin for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	mpopov	T132706 Validate click events in TestSearchSatisfaction2
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library

Event Timeline

TJones created this task.Dec 15 2015, 5:42 PM

TJones assigned this task to Smalyshev.

TJones raised the priority of this task from to High.

TJones updated the task description. (Show Details)

TJones added a project: CirrusSearch.

TJones added subscribers: TJones, EBernhardson.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

TJones mentioned this in T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.Dec 15 2015, 5:48 PM

TJones added a parent task: T121542: Write and deploy an A/B Test on enwiki using TextCat for Language Identification.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

TJones added a parent task: T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification.Dec 22 2015, 5:22 PM

Smalyshev added a project: Discovery-Search (Current work).Dec 22 2015, 5:25 PM

Smalyshev set Security to None.

Smalyshev moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

We might be able to use this for T90523: Detect LTR/RTL directionality on a per-post basis when it's saved as well.

@Mattflaschen if the library finds wider use it'd be even better. I'm planning to add it to wikimedia gerrit and to mediawiki/vendors.

@Mattflaschen: I'm not sure if this would be a great approach for RTL detection. In particular, we're training models based on queries, and seeing some improvement over general language models. That suggests they could perform worse on general language. Also, LTR/RTL detection generally happens at the script level, not the language level (you don't need to distinguish Arabic from Farsii to know it should be RTL).

Seems like you could count individual characters (most are unambiguously RTL or LTR) and choose RTL/LTR based on the preponderance of the text—so that, for example, one Hebrew word in a long string of French is LTR, but one word of Spanish in a long string of Arabic is RTL. And you don't need to detect Hebrew, French, Spanish, or Arabic as languages to figure that out.

But that's just off the top of my head, and your mileage may vary.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.Dec 31 2015, 5:20 AM

Notes on a detailed discussion on IRC about TextCat and Lang ID here: https://phabricator.wikimedia.org/T118278#1919183

• Mattflaschen-WMF added a subscriber: Mooeypoo.Jan 7 2016, 7:13 AM

Smalyshev added a subtask: T123537: Generate wikitext-based and query-based language models for TextCat.Jan 13 2016, 7:56 PM

Smalyshev added a subtask: T123558: Security review for TextCat library.Jan 14 2016, 6:50 PM

Smalyshev moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.

Just to double check, the 'needs review' part of this ticket is waiting on security review?

Yes, for this one it's security review. Once it's done I imagine we'll also need a patch to add it to mediawiki vendors repo.

Smalyshev added a parent task: T124844: Add textcat to mediawiki vendor libs.Jan 26 2016, 11:12 PM

Smalyshev moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Jan 26 2016, 11:15 PM

Isn't this task, as it's written, complete? We converted TextCat to a PHP library and it's now available in the repository wikimedia/textcat. I think we should close this. @Smalyshev @TJones, thoughts?

I'd defer to @Smalyshev, but it looks like the security review (T123558) went fine, so the only open question is whether the patch to add it to mediawiki vendors repo is covered by this task, or a different one.

Well, we wanted to also update the models, but technically it already has *some* models, so I guess we can close it.

Smalyshev closed this task as Resolved.Jan 28 2016, 6:03 PM

In T121538#1975902, @TJones wrote:

I'd defer to @Smalyshev, but it looks like the security review (T123558) went fine, so the only open question is whether the patch to add it to mediawiki vendors repo is covered by this task, or a different one.

That is tracked by T124844, so I think we're good here.

In T121538#1976038, @Smalyshev wrote:

Well, we wanted to also update the models, but technically it already has *some* models, so I guess we can close it.

Do you want to create a separate task for that, or is it such a quick thing that it doesn't need one?

• Deskana moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Jan 28 2016, 6:09 PM

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.

We have a task for it, T123537

In T121538#1978585, @Smalyshev wrote:

We have a task for it, T123537

Aha, fantastic. Thanks!

fwiw the security review, typically, is considered enough to have it added to the production vendors repository. We should, as noted in the review, ensure it pulls from gerrit and not github.

• Mattflaschen-WMF unsubscribed.Feb 2 2016, 2:46 AM

Smalyshev closed subtask T123537: Generate wikitext-based and query-based language models for TextCat as Resolved.Feb 3 2016, 1:08 AM

Smalyshev closed subtask T123558: Security review for TextCat library as Resolved.Feb 11 2016, 7:34 PM

Convert TextCat to PHP Library for Language Identification in Cirrus SearchClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Convert TextCat to PHP Library for Language Identification in Cirrus Search
Closed, ResolvedPublic
Actions

Related Objects
Search...