Page MenuHomePhabricator

MagicWord.php uses incorrect regex and logic for "$1" variable capturing
Closed, ResolvedPublic

Description

Author: rowan.collins

Description:
MagicWord.php contains code for using "$1" in magic word definitions to capture
a variable - as currently used by "$1px", which sets the rescaled width of an
image. However, on investigation this seems to work only by a couple of lucky
coincidences: the regex will not match correctly if there is more than one
synonym, and the part returned is the whole string not the variable part (e.g.
"100px" not "100"; PHP just happens to treat this as "100" if you ask it to turn
it into an integer). After much gnashing of teeth and trial-and-error, I've come
up with an implementation that works as originally intended.

My patch will follow shortly.


Version: 1.4.x
Severity: normal

Details

Reference
bz688

Revisions and Commits

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:00 PM
bzimport set Reference to bz688.

rowan.collins wrote:

patch to fix broken regexes

OK, here's my fix, diffed against the newest HEAD I could access on
cvs.defau.lt, which included what looked like a special-case pseudo-fix for the
same problem. I spent hours figuring out the best regexes for this, so I think
this should be a more complete fix than that.

Note that the patch applies cleanly, but I haven't actually tested it for a
while, because I broke my test-install a few days ago. I see no reason for it
to have become broken in the intervening time, however.

This patch also disables the use of Title:legalChars() as a class for variable
arguments, because I don't see why this should be the case for every
conceivable use of magic word variables. If this check is necessary for a
particular case, it should surely happen elsewhere - either before the
MagicWord is checked, or after the variable has been returned.

attachment MagicWord-fix.diff.2 ignored as obsolete

rowan.collins wrote:

OK, that was an oversight on my part: my version creates a regex of the form
"^foo$|^bor$" for the startToEndRegex; we could easily do the same for
RegexStart. Alternatively, we could probably use "(?: ... )" rather than "( ...
)" to group the alternatives but avoid creating an unnecessary extra captured
match. I'll make an updated patch that does that.

The main reason for creating the patch, however, remains (as far as I can see):
if there are multiple synonyms for a magic word with a $1 capture, only the
first will work correctly. This is because each synonym creates an extra
captured value, hence the use of array_values(array_filter()) to remove empty
matches.

e.g. in the regex "/foo=(.*?)|bar=(.*?)/" there are 2 different captures

  • matching against the string "foo=100" would produce an array with

$m[0]="foo=100" and $m[1]="100" (as expected)

  • but matching against "bar=100" would produce $m[0]="foo=100", $m[1]="",

$m[2]="100", since the first set of brackets, after "foo=" was never matched

rowan.collins wrote:

patch that also deals with regexStart appropriately

With the "(?: ... )" approach, we don't even need to do the str_replace() more
than once, since mVariableStartToEndRegex is just mVariableRegex with anchors
placed around it.

Again, I'm afraid I can't test this at the moment, but I'm pretty sure this
will fix *all* the bugs.

attachment MagicWord-fix.diff.3 ignored as obsolete

wmahan_04 wrote:

(In reply to comment #4)

With the "(?: ... )" approach, we don't even need to do the str_replace() more
than once, since mVariableStartToEndRegex is just mVariableRegex with anchors
placed around it.

$this->mRegex = "/{$this->mBaseRegex}/{$case}";
...
$this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex );
$this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}";

Mistakes like this love to creep into untested patches. ;)

Also, I don't think any magic word contains a character not allowed in a
title, so using "." rather than Title::legalChars() is asking for trouble
for no good reason.

rowan.collins wrote:

(In reply to comment #5)

$this->mRegex = "/{$this->mBaseRegex}/{$case}";
...
$this->mVariableRegex = str_replace( '\$1', '(.*?)', $this->mRegex );
$this->mVariableStartToEndRegex = "/^(?:{$this->mVariableRegex})$/{$case}";

Mistakes like this love to creep into untested patches. ;)

Ah, foo! I really must get that test install fixed, mustn't I. *sigh* Reading
the code properly would have helped, too. Sorry about that. Of course, arguably,
we should do the substitution right back at mBaseRegex anyway, since we never
actually want to match a literal "$1". I won't post another untested patch,
though, because who knows what I'll break next! :-/

Also, I don't think any magic word contains a character not allowed in a
title, so using "." rather than Title::legalChars() is asking for trouble
for no good reason.

The (.*?) / ([$variableclass]*?) doesn't apply to the magic words themselves, it
applies to what $1 is recognised as. At the moment, there's only one magic word
that *uses* $1, which is "$1px", but surely the point of the system is to be
reusable, not defined for one specific instance. Confining the variable to
Title::legalChars() just seemed to me a bit arbitrary: the one use we have so
far should actually only contain ([0-9x]*?) - more specifically, the variable
should match "/^([0-9]*)(?:x([0-9]*))?$/" - but quite rightly the code for
checking that comes *after* the call to MagicWord.

Of course, currently, we have nothing that breaks because of this either, but
for instance a full implementation for bug 539 would need variables that could
contain anything an internal or external link might contain, including things
like '#', '+', etc (it was that bug that first got me hacking MagicWord in the
first place). As far as I can see, a call to legalChars() gains us nothing,
since it's far simpler to match (.*?) and check for validity on a case-by-case
basis.

rowan.collins wrote:

fully tested patch

OK, here at last is a properly tested patch that fixes the behaviour properly,
using "(?:)" and actually guaranteeing that matched variables are returned. The
most controversial change is eliminating $variable_class, but like I say, I
really don't see why this check belongs here, rather than being made more
thoroughly and specifically later in processing of each MagicWord.

I even ran maintenance/parserTests.php, and only got unrelated FAILs. That's
not to say I haven't made any more stupid mistakes, but I can't think what
they'd be. And if I haven't, I would appreciate it if someone could commit this
for me; thanks.

Attached:

Patch applied to CVS HEAD.

This fixes the cyrillic form of 'px' for eg [[image:foo|150пкс]] in Russian and some other
localizations.

epriestley added a commit: Unknown Object (Diffusion Commit).Mar 4 2015, 8:20 AM