Page MenuHomePhabricator

Convert TextCat to PHP Library for Language Identification in Cirrus Search
Closed, ResolvedPublic

Description

TextCat currently seems more promising than the baseline ES Plugin we've been using. It's in very old Perl and should be converted to PHP.

Stas has already started working on converting TextCat to PHP for use in Cirrus Search (available on GitHub: https://github.com/smalyshev/textcat), and he and Erik have been brainstorming on ways of making it more efficient, too. It needs some testing (e.g., Unicode compatibility) and comparison to the Perl version (i.e., same results on building modela and running on test queries).

Rough estimate: < 1 week

Related Objects

Event Timeline

TJones assigned this task to Smalyshev.
TJones raised the priority of this task from to High.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones added subscribers: TJones, EBernhardson.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Mattflaschen if the library finds wider use it'd be even better. I'm planning to add it to wikimedia gerrit and to mediawiki/vendors.

@Mattflaschen: I'm not sure if this would be a great approach for RTL detection. In particular, we're training models based on queries, and seeing some improvement over general language models. That suggests they could perform worse on general language. Also, LTR/RTL detection generally happens at the script level, not the language level (you don't need to distinguish Arabic from Farsii to know it should be RTL).

Seems like you could count individual characters (most are unambiguously RTL or LTR) and choose RTL/LTR based on the preponderance of the text—so that, for example, one Hebrew word in a long string of French is LTR, but one word of Spanish in a long string of Arabic is RTL. And you don't need to detect Hebrew, French, Spanish, or Arabic as languages to figure that out.

But that's just off the top of my head, and your mileage may vary.

Just to double check, the 'needs review' part of this ticket is waiting on security review?

Yes, for this one it's security review. Once it's done I imagine we'll also need a patch to add it to mediawiki vendors repo.

Isn't this task, as it's written, complete? We converted TextCat to a PHP library and it's now available in the repository wikimedia/textcat. I think we should close this. @Smalyshev @TJones, thoughts?

I'd defer to @Smalyshev, but it looks like the security review (T123558) went fine, so the only open question is whether the patch to add it to mediawiki vendors repo is covered by this task, or a different one.

Well, we wanted to also update the models, but technically it already has *some* models, so I guess we can close it.

I'd defer to @Smalyshev, but it looks like the security review (T123558) went fine, so the only open question is whether the patch to add it to mediawiki vendors repo is covered by this task, or a different one.

That is tracked by T124844, so I think we're good here.

Well, we wanted to also update the models, but technically it already has *some* models, so I guess we can close it.

Do you want to create a separate task for that, or is it such a quick thing that it doesn't need one?

We have a task for it, T123537

Aha, fantastic. Thanks!

fwiw the security review, typically, is considered enough to have it added to the production vendors repository. We should, as noted in the review, ensure it pulls from gerrit and not github.