Combining ccnorm with regex patterns can become difficult.
Easy case: If both patterns "abc" and "abd" are of interest, one would think that a rule like added_lines rlike 'ab[cd]' would be ideal, and you can even to combine this with ccnorm to also find patterns like "ãbc" as well:
ccnorm(added_lines) rlike ccnorm('ab[cd]')
Difficult case: If many patterns are of interest, then one would naturally want to use the | character in the regex pattern (e.g. 'abc|xyz|opq') but this will not combine well with ccnorm. For instance, if added_lines is equal to "abc", the following rule will *not* work:
ccnorm(added_lines) rlike ccnorm('abc|xyz|opq')
The reason is Equivset [[ https://github.com/wikimedia/Equivset/blob/master/data/equivset.in#L66 | normalizes the pipe character to 1]].
This alternative works:
match := ccnorm('abc') + '|' + ccnorm('xyz') + '|' + ccnorm('opq'); ( ccnorm(added_lines) rlike match )
But this is unnecessarily long.
Removing the Equivset rule that converts | to 1 may be an option, but it is not ideal because it may impact non-regex use cases of such normalization. A better alternative might be to modify ccnorm to take two parameters, where the second parameter is an optional Boolean to specify whether the first parameter is a regex pattern or not. If set to true, then ccnorm would make sure that regex-specific characters (such as |) are not passed through Equivset::normalize().
This issue demonstrates itself in the following ways so far:
- passing \b through ccnorm will break the regex pattern
- passing | through ccnorm will break the regex pattern