Page MenuHomePhabricator

Proposed query enhancement: regexps and Type:Text search
Closed, DeclinedPublic

Description

Author: solbrig.harold

Description:
Patch for proposed query search enhancement

The attached patch is one that we have been using locally. It implements the following

  1. It makes _text fields queryable (which they aren’t at the moment).
  2. It changes wild card searches to case insensitive. (If this has backwards compatibility issues, we can get more clever, but we need case insensitive regular expressions)
  3. It adds the ability to do regular expressions (mySQL REGEXP) in the wild card.

To do wild card, you first have to enable wild card searches ($smwgQComparators = ‘<|>|!|~’;)
Then, if the match pattern starts with ‘/’ and ends with either ‘/’ or ‘/i’ (e.g. [[text::~/abc*defg*/i]] then the REGEXP comparison is used in the SQL call. The "i" on the end indicates case insensitive search.

The regex thing thing is only so neat at the moment, as many of the regular expression characters get caught elsewhere in the parser, so we will want to do a bit more work – either we need to identify replacement characters for the standard RE characters of we need to borrow from the DPL2 approach and use special characters for "|", "[", and "{".

There is one issue that I need some help with – the only way that I’ve found to do a case insensitive search is to add “COLLATE utf8_general_ci “ immediately following the LIKE or REGEXP phrase. The problem, however, is that this has to be “ COLLATE latin1_general_ci “ if the database was built as latin character set or UTF8 binary(?) and utf8 otherwise. Lines 373 and 563 have these phrases.


Version: unspecified
Severity: enhancement

Attached:

Details

Reference
bz21314

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:56 PM
bzimport set Reference to bz21314.

Both regular expressions and pattern matching on text blobs is generally assumed to lead to decreased performance. Did you do any checks regarding this? Does your extension allow for arbitrarily complex regular expressions (besides the syntactic constraints that you mentioned)?

solbrig.harold wrote:

(In reply to comment #1)

Both regular expressions and pattern matching on text blobs is generally

assumed to lead to decreased performance. Did you do any checks regarding this?

No - we needed this capability and, in our circumstances, it performed quite well. Note that this is only an issue, however, *if* you search text blobs - there is (almost) no change to the behavior if you don't. All we've done is given the user an option that they didn't previously have. I state "almost" because the fix, as submitted, does a case insensitive search on both "traditional" wild cards and regular expressions. This decision could be reversed, however, if there is any concern

Does your extension allow for arbitrarily complex regular expressions (besides the syntactic constraints that you mentioned)?

The extension passes everything between the slashes to the MySQL REGEX operator (http://dev.mysql.com/doc/refman/5.1/en/regexp.html), so the capability is whatever is allowed here. This is obviously less than ideal because of the database dependency, but, at the moment it seemed like the only viable option.

Of some concern is the fact that this may well be a security issue, as it is probably possible to inject SQL into the match text. Note, however, that this situation existed prior to our patch - the wild card search had exactly the same issues. If this is of concern, we would be happy to do what is necessary to lock both searches down.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

Marking "need-review" basically because it sounds like there's an unfinished code review conversation here. If I'm wrong, please feel free to replace with "reviewed."

Late comment: For general deployment, case insensitivity needs its own request indicator. A case sensitive ~*Mark* is a better wildcard search for people named Mark than the case insensitive version.

Aklapper subscribed.

The Semantic MediaWiki developers requested in https://phabricator.wikimedia.org/T64114 to move their task tracking to https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues and to close remaining tasks in Wikimedia Phabricator. If you still face the problem reported in this task in a supported version of SMW, please feel free to transfer your report to https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues . We are sorry for the inconvenience.