Page MenuHomePhabricator

Fix [a-ž] not to include pipe ("|") and uppercase letters
Closed, DeclinedPublic

Description

On cswiki, if I want to search any czech words, I have to do this (assume I want to search for a heading):

insource:/== *[a-ž0-9 ]+ *==/

Expected output of [a-ž] would be the whole lowercase alphabet with diacritics included (for [A-Ž] uppercase):
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž

But it includes uppercase characters and pipe (|) too:
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ|

More wrong behavior outputs [A-Ž]:
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ|[]{}

Event Timeline

Dvorapa created this task.Jan 27 2017, 8:16 AM
Dvorapa updated the task description. (Show Details)
TTO added a subscriber: TTO.EditedJan 27 2017, 8:40 AM

The way the regular expression language works is that [a-ž] matches EVERY Unicode character between a and ž, which includes a significant amount of punctuation, as well as many interesting non-Czech letters like ŋ and Ħ. This is the way the regular expression language is defined; the regular expression language couldn't be localised, as this would create problematic inconsistency between wikis.

Unfortunately there doesn't seem to be a good way to match all lowercase letters, since the regular expression engine CirrusSearch uses doesn't seem to support the \p{Ll} syntax to match all characters which are specified as lowercase letters in the Unicode Standard.

I have to agree with @TTO here, making [a-ž] to match only lower case letters that are part of the czech alphabet is against the principle of regex character range.
I'd suggest changing the task title and description to request the inclusion of more convenient classes like suggested in the previous comment. Unfortunately the regex engine used by cirrus is based on Lucene. We would have to convince them to include such classes.

I don't see a perfect and convenient syntax that would match your needs. The only option I see for the moment is to list explicitly all the chars you want [a-záč...] (use the ascii char range a-z with your additional czech letters.
Sometimes it's more convenient to work the other way around, instead of including chars you could exclude the ones you don't want, maybe you'll get a better approximation by excluding the chars you don't want by using [^...]: insource://== [^{}#|]+ ==//. I'm not sure it's really useful in your case.

Or maybe you could be more explicit on the reasons you would like to have such regex classes. Maybe we could work on more dedicated cirrus keywords such as insection:word. Unfortunately I have to confess that including such regex classes in the lucene regex is unlikely to happen.

The problem of insource:/== *[a-z0-9ěščřžýáíéúůóďťň ]+ *==/ is that it produces a timeout warning

@dcausse I don't use this only for headings, I don't think insection will be useful enough to maintain its code. Furthermore intitle and hastemplate currently on cswiki are not working properly so I don't think new keywords could be better.

It produces a timeout because the number of pages that potentially match is huge, but we're still returning partial results. Having a dedicated character class would not make the regex faster, it would only make it more convenient. The class would still have to be expanded in memory resulting in exactly the same regex.
Concerning intitle and hastemplate I'm sorry about that but I was not aware of specific problems on cswiki, could you point me to an existing ticket that describes the issue (or create one)? Thanks.

Dvorapa added a comment.EditedJan 27 2017, 9:58 AM

@dcausse Okay, thank you for your help. I understand this is more an issue of regex basic principles which we could not change and you can close this task as Declined. I hope there will be some localized search options in the future (when servers gonna be super-fast and maybe there will be something better than regex one day).

You can see T155292 and T156460

dcausse closed this task as Declined.Jan 27 2017, 10:02 AM

Thanks for the pointers, I'll take a look.