The Perl version of TextCat uses only a simple set of non-word characters to separate words - '0-9\s'. However, in natural texts more characters could be separating words and found at the end of a word - such as parentheses, comma, period, etc. We need to test if it makes sense to amend that regular expression or maybe to use some special regexp syntax like \b or \W, to improve model quality.
Description
Event Timeline
@TJones @Smalyshev Does this block anything specific, or is this just a more along the lines of a general improvement? Would be good to know, so that we can determine whether or not it can be done after we actually get an A/B test out that uses TextCat.
We can keep using the minimal set os separators, but the result would (might) be sub-optimal then. So, I think it is useful to do this before we do T123537. Though strictly speaking we could do without it if we had to
Thanks. Marking this as blocking T123537 for now makes sense. We can revisit and maybe remove that requirement if this ends up taking significantly longer than expected.
I've tested the various punctuation options. Long story short, parens hurt, but periods help. More detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Additional_Non-Word_Characters
I'm going to commit the change to the PHP version of TextCat.
Great! Can you link to that commit here, so that once it's merged we can close this task?
I got a little over-excited on Friday and got ahead of myself. While the test is done, the language models need to be regenerated with the new separator. I've just done that, but now I'm wrestling with Gerrit. Will link as soon as I can.