Fix [a-ž] not to include pipe ("|") and uppercase letters
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Dvorapa
	Jan 27 2017, 8:16 AM

Description

On cswiki, if I want to search any czech words, I have to do this (assume I want to search for a heading):

insource:/== *[a-ž0-9 ]+ *==/

Expected output of [a-ž] would be the whole lowercase alphabet with diacritics included (for [A-Ž] uppercase):
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzž

But it includes uppercase characters and pipe (|) too:
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzžÁČĎÉĚÍŇÓŘŠŤÚŮÝŽ|

More wrong behavior outputs [A-Ž]:
aábcčdďeéěfghiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ|[]{}

Related Objects

Mentioned Here: T155292: CirrusSearch: hastemplate does not work properly
T156460: CirrusSearch: intitle does not work properly

Event Timeline

Dvorapa created this task.Jan 27 2017, 8:16 AM

Dvorapa updated the task description. (Show Details)

The way the regular expression language works is that [a-ž] matches EVERY Unicode character between a and ž, which includes a significant amount of punctuation, as well as many interesting non-Czech letters like ŋ and Ħ. This is the way the regular expression language is defined; the regular expression language couldn't be localised, as this would create problematic inconsistency between wikis.

Unfortunately there doesn't seem to be a good way to match all lowercase letters, since the regular expression engine CirrusSearch uses doesn't seem to support the \p{Ll} syntax to match all characters which are specified as lowercase letters in the Unicode Standard.

I have to agree with @TTO here, making [a-ž] to match only lower case letters that are part of the czech alphabet is against the principle of regex character range.
I'd suggest changing the task title and description to request the inclusion of more convenient classes like suggested in the previous comment. Unfortunately the regex engine used by cirrus is based on Lucene. We would have to convince them to include such classes.

I don't see a perfect and convenient syntax that would match your needs. The only option I see for the moment is to list explicitly all the chars you want [a-záč...] (use the ascii char range a-z with your additional czech letters.
Sometimes it's more convenient to work the other way around, instead of including chars you could exclude the ones you don't want, maybe you'll get a better approximation by excluding the chars you don't want by using [^...]: insource://== [^{}#|]+ ==//. I'm not sure it's really useful in your case.

Or maybe you could be more explicit on the reasons you would like to have such regex classes. Maybe we could work on more dedicated cirrus keywords such as insection:word. Unfortunately I have to confess that including such regex classes in the lucene regex is unlikely to happen.

The problem of insource:/== *[a-z0-9ěščřžýáíéúůóďťň ]+ *==/ is that it produces a timeout warning

@dcausse I don't use this only for headings, I don't think insection will be useful enough to maintain its code. Furthermore intitle and hastemplate currently on cswiki are not working properly so I don't think new keywords could be better.

It produces a timeout because the number of pages that potentially match is huge, but we're still returning partial results. Having a dedicated character class would not make the regex faster, it would only make it more convenient. The class would still have to be expanded in memory resulting in exactly the same regex.
Concerning intitle and hastemplate I'm sorry about that but I was not aware of specific problems on cswiki, could you point me to an existing ticket that describes the issue (or create one)? Thanks.

@dcausse Okay, thank you for your help. I understand this is more an issue of regex basic principles which we could not change and you can close this task as Declined. I hope there will be some localized search options in the future (when servers gonna be super-fast and maybe there will be something better than regex one day).

You can see T155292 and T156460

Thanks for the pointers, I'll take a look.

matej_suchanek unsubscribed.Jan 27 2017, 11:35 AM

Fix [a-ž] not to include pipe ("|") and uppercase lettersClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Fix [a-ž] not to include pipe ("|") and uppercase letters
Closed, DeclinedPublic
Actions