Maniphest T190648

Make ccnorm friendlier to regex patterns
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	Huji
	Mar 25 2018, 2:37 PM

Description

Combining ccnorm with regex patterns can become difficult.

Easy case: If both patterns "abc" and "abd" are of interest, one would think that a rule like added_lines rlike 'ab[cd]' would be ideal, and you can even to combine this with ccnorm to also find patterns like "ãbc" as well:

ccnorm(added_lines) rlike ccnorm('ab[cd]')

Difficult case: If many patterns are of interest, then one would naturally want to use the | character in the regex pattern (e.g. 'abc|xyz|opq') but this will not combine well with ccnorm. For instance, if added_lines is equal to "abc", the following rule will *not* work:

ccnorm(added_lines) rlike ccnorm('abc|xyz|opq')

The reason is Equivset [[ https://github.com/wikimedia/Equivset/blob/master/data/equivset.in#L66 | normalizes the pipe character to 1]].

This alternative works:

match := ccnorm('abc') + '|' + ccnorm('xyz') + '|' + ccnorm('opq');
(
ccnorm(added_lines) rlike match
)

But this is unnecessarily long.

Removing the Equivset rule that converts | to 1 may be an option, but it is not ideal because it may impact non-regex use cases of such normalization. A better alternative might be to modify ccnorm to take two parameters, where the second parameter is an optional Boolean to specify whether the first parameter is a regex pattern or not. If set to true, then ccnorm would make sure that regex-specific characters (such as |) are not passed through Equivset::normalize().

This issue demonstrates itself in the following ways so far: