Page MenuHomePhabricator

Previewed regex differs from actual replacement (as one is PHP's handling, and the other is the database's)
Open, LowPublic

Description

The preview of the replacement to be made by the following
regex does not match what is actually replaced.

   Original text: Document Number=(POL|PRO) [0-9]+\.([0-9]+)
Replacement text: $2

The original texts are of the form:

Document Number=POL 1.23
Document Number=PRO 23.5
Document Number=PRO 2.9

and so on. The idea is to be left with:

Document Number=23
Document Number=5
Document Number=9

i.e. strip what are actually document types and manual section
numbers, and leave only what's after the dot. The regex I gave
above, when previewed with these original values, only highlights:

Document Number=POL 1.2
Document Number=PRO 23.5
Document Number=PRO 2.9

That is, it is not being greedy about the numbers after the dot,
but it is being so with the numbers before it. I tried removing
the nongreedy-ing 'U' in

$targetStr = "/$target/U";

in SpecialReplaceText.php extractContext(), and it correctly highlighted
the whole thing. But I haven't really read though the ramifications
of doing that.

Anyway, the point is that when I actually *run* the replacement
(with unmodified code) the correct, greedy, replacement *is* made!

I hope this all makes sense. :-) Thanks!


Version: REL1_19-branch
Severity: normal

Details

Reference
bz38944

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:50 AM
bzimport set Reference to bz38944.
bzimport added a subscriber: Unknown Object (MLST).

I'm not surprised that there are differences between the two, since one uses PHP's regex handling, and the other uses MySQL's (or whatever database system is being used) - actually, the surprising thing is that the two work as similarly as they do. It would probably take a lot of work to get the two to match each other more closely.

Well, I guess it's much faster this way, and one just needs to know to construct regexes that are compatible with both PHP and the DBMS. But I think what I'm seeing here is not about incompatibility between the regex engines: because the lines are being found correctly, but just highlighted wrongly.

The process seems to be as follows....

To preview, in SpecialReplaceText.php:

  1. Use the DB's regexp to find the pages -- regexCond(): "$column $op " . $dbr->addQuotes( $regex );
  2. Then find the lines in each page that match -- preg_match_all("/$target/", $text, $matches, PREG_OFFSET_CAPTURE);
  3. Then, for each matching line, highlight the result -- $targetStr = "/$target/U"; preg_replace( $targetStr, '<span class="searchmatch">\0</span>', $snippet);
  4. Then create the job, saving the page name, regex, etc.

Then to replace, ReplaceTextJob.php:

  1. For each job, create new text -- preg_replace( '/'.$target_str.'/U', $replacement_str, $article_text, -1, $num_matches );

So is it just a matter of removing the Ungreedy modifiers? That fixes the highlighting problem that I'm seeing, but I'm sure other people know better than I about what else that would break!

Thanks for taking the time to look at this. :-)