Page MenuHomePhabricator

Make ccnorm friendlier to regex patterns
Closed, DuplicatePublic

Description

Combining ccnorm with regex patterns can become difficult.

Easy case: If both patterns "abc" and "abd" are of interest, one would think that a rule like added_lines rlike 'ab[cd]' would be ideal, and you can even to combine this with ccnorm to also find patterns like "ãbc" as well:

ccnorm(added_lines) rlike ccnorm('ab[cd]')

Difficult case: If many patterns are of interest, then one would naturally want to use the | character in the regex pattern (e.g. 'abc|xyz|opq') but this will not combine well with ccnorm. For instance, if added_lines is equal to "abc", the following rule will *not* work:

ccnorm(added_lines) rlike ccnorm('abc|xyz|opq')

The reason is Equivset [[ https://github.com/wikimedia/Equivset/blob/master/data/equivset.in#L66 | normalizes the pipe character to 1]].

This alternative works:

match := ccnorm('abc') + '|' + ccnorm('xyz') + '|' + ccnorm('opq');
(
ccnorm(added_lines) rlike match
)

But this is unnecessarily long.

Removing the Equivset rule that converts | to 1 may be an option, but it is not ideal because it may impact non-regex use cases of such normalization. A better alternative might be to modify ccnorm to take two parameters, where the second parameter is an optional Boolean to specify whether the first parameter is a regex pattern or not. If set to true, then ccnorm would make sure that regex-specific characters (such as |) are not passed through Equivset::normalize().

This issue demonstrates itself in the following ways so far:

  • passing \b through ccnorm will break the regex pattern
  • passing | through ccnorm will break the regex pattern

Event Timeline

This is not exactly a duplicate; the part about | can be addressed using the proposed "matches_any", but the part about \b is a separate issue.

@Huji Right, not exactly a dupe, but probably better to address everything at the same time. Updating the other task's description.