Bogus entries in externallinks table due to unescaping of &%=+
Closed, ResolvedPublic

Description

Consider this URL:

http://example.com/index.php?foo=bar%26baz%3Dquux%2Bquux

It has one parameter, foo, with the value "bar&baz=quux+quux". Place this in an article and the externallinks table will contain this URL instead:

http://example.com/index.php?foo=bar&baz=quux+quux

This has *two* parameters, foo with the value "bar" and baz with the value "quux quux".

Then try this URL:

http://example.com/index.php?foo=%25xx

The value of foo is "%xx". But put it into an article, and externallinks will contain this URL instead:

http://example.com/index.php?foo=%xx

That's not even valid.

The problem lies in Parser::replaceUnusualEscapesCallback, it will unescape %25, %26, %2B, and %3D despite these all having special meaning in a URL when unescaped. I see a similar-sounding problem was reported in bug 4781, which was closed as "fixed" with no reference to the revision in which it was fixed. Bug 40267 also touched upon this issue, but these real problems appear to have been overlooked since the reporter there focused on the unescaping of various safe characters rather than only these unsafe ones.


Version: 1.23.0
Severity: normal

bzimport added a project: MediaWiki-Parser.Via ConduitNov 22 2014, 2:41 AM
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz57909.
Anomie created this task.Via LegacyDec 3 2013, 3:06 AM
Anomie added a comment.Via ConduitDec 3 2013, 3:10 AM

So the question I have is: Can we just change replaceUnusualEscapesCallback (leaving externallinks inconsistent until all these pages happen to be reparsed), or should we try to figure out which pages are affected and run a maintenance script of some sort over them, or is externallinks supposed to contain such broken entries?

MZMcBride added a comment.Via ConduitDec 3 2013, 3:33 AM

(In reply to comment #1)

So the question I have is: Can we just change replaceUnusualEscapesCallback
(leaving externallinks inconsistent until all these pages happen to be
reparsed), or should we try to figure out which pages are affected and run a
maintenance script of some sort over them, or is externallinks supposed to
contain such broken entries?

You could null edit all the pages. :-)

gerritbot added a comment.Via ConduitAug 8 2014, 11:09 AM

Change 152889 had a related patch set uploaded by Anomie:
Improve Parser::replaceUnusualEscapes

https://gerrit.wikimedia.org/r/152889

gerritbot added a comment.Via ConduitSep 16 2014, 11:07 PM

Change 152889 merged by jenkins-bot:
Improve/rename Parser::replaceUnusualEscapes

https://gerrit.wikimedia.org/r/152889

Add Comment