Page MenuHomePhabricator

Decide which set of separators we have to use for TextCat ngrams
Closed, ResolvedPublic

Description

The Perl version of TextCat uses only a simple set of non-word characters to separate words - '0-9\s'. However, in natural texts more characters could be separating words and found at the end of a word - such as parentheses, comma, period, etc. We need to test if it makes sense to amend that regular expression or maybe to use some special regexp syntax like \b or \W, to improve model quality.

Related Objects

StatusSubtypeAssignedTask
ResolvedEBernhardson
Declinedmpopov
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
OpenNone
ResolvedEBernhardson
ResolvedEBernhardson
ResolvedEBernhardson
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson

Event Timeline

Smalyshev assigned this task to TJones.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.

@TJones @Smalyshev Does this block anything specific, or is this just a more along the lines of a general improvement? Would be good to know, so that we can determine whether or not it can be done after we actually get an A/B test out that uses TextCat.

We can keep using the minimal set os separators, but the result would (might) be sub-optimal then. So, I think it is useful to do this before we do T123537. Though strictly speaking we could do without it if we had to

We can keep using the minimal set os separators, but the result would (might) be sub-optimal then. So, I think it is useful to do this before we do T123537. Though strictly speaking we could do without it if we had to

Thanks. Marking this as blocking T123537 for now makes sense. We can revisit and maybe remove that requirement if this ends up taking significantly longer than expected.

I've tested the various punctuation options. Long story short, parens hurt, but periods help. More detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Additional_Non-Word_Characters

I'm going to commit the change to the PHP version of TextCat.

I've tested the various punctuation options. Long story short, parens hurt, but periods help. More detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Additional_Non-Word_Characters

I'm going to commit the change to the PHP version of TextCat.

Great! Can you link to that commit here, so that once it's merged we can close this task?

I got a little over-excited on Friday and got ahead of myself. While the test is done, the language models need to be regenerated with the new separator. I've just done that, but now I'm wrestling with Gerrit. Will link as soon as I can.