Page MenuHomePhabricator

"&params" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms)
Closed, ResolvedPublic

Description

[[File:Erioll world.svg|15px|alt=Welt-Icon|link=//tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&language=de&params=47_N_16_E]]

creates a link

https://tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&language=de%C2%B6ms=47_N_16_E

where

&params=

is replaced by

¶ms=

It looks like &params is replaced, where only ¶ms should be.

The behavior is new, I would expect new since last version update.

Related Objects

Mentioned In
T212173: breaking change: link= <no-value> in icons in image descriptions is broken
T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group
T205546: Create Wiktionary Cantonese
T205710: Create Wikinews Limburgish
T206777: Create Wikipedia Shan
T207286: Time profiling: Replace millisecond granularity timers with microsecond granularity timers
T208470: Parsoid should nowiki-escape '}' in a table cell or insert a whitespace character, as appropriate
T187142: Deduplicate template styles in Parsoid
T184755: Consider not removing multiple blank lines/white space between paragraphs
T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities
Mentioned Here
rGPAR18a98afcd12c: Don't normalize mw-empty-elt empty paragraphs that come from source
T184755: Consider not removing multiple blank lines/white space between paragraphs
T187142: Deduplicate template styles in Parsoid
T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group
T205546: Create Wiktionary Cantonese
T205710: Create Wikinews Limburgish
T206777: Create Wikipedia Shan
T207286: Time profiling: Replace millisecond granularity timers with microsecond granularity timers
T208470: Parsoid should nowiki-escape '}' in a table cell or insert a whitespace character, as appropriate
T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities
P7844 Generate regexp for semicolon-less HTML entities
rMWad80f0bca27c: Handle <nowiki> and other markup consistently in image link/alt options

Event Timeline

Herzi.Pinki updated the task description. (Show Details)
Herzi.Pinki updated the task description. (Show Details)

@Aklapper, seems to be an easy to fix problem, something like the typical off by 1 (you can hang me of course, if I'm completely wrong)
can you please schedule it to the correct project?

@Herzi.Pinki: That's already the case; see the "Tags" section in the right pane.

Aklapper renamed this task from Parser fails to parse file link (with link parameter) to "&params" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms).Nov 19 2018, 11:07 AM
Aklapper raised the priority of this task from High to Needs Triage.
Aklapper added a project: Regression.

Not sure whether the scope of this regression seems clear to everyone … at a minimum, it affects all pages on German Wikipedia that use Vorlage:Coordinate with an argument of "ICON2" for the parameter text -- all of those GeoHack links are now broken.

Not even a triage yet?

@Ebab: I don't know how often the Parsing team goes through new parser-related tasks... Please see https://www.mediawiki.org/wiki/Parsing for contact info.

Currently we have about 1500 articles in the German WP affected by this problem (https://de.wikipedia.org/w/index.php?search=insource%3A%22ICON2%22&title=Spezial%3ASuche&go=Artikel) with usually multiple occurrences of the problem. Any change in such a page will invalidate the cache and later affect all readers. If I do a change (e.g. a typo), I do not re-check all the coordinate links.

Similar:

But that's only WP:de.

@Aklapper, is this the wrong tool to file problem reports? IMHO this is a strange behaviour (I'm lucky that I could isolate it to the minimum) and it is interesting behaviour, as this could also effect other features: some string replacement goes wrong. I will ask at https://www.mediawiki.org/wiki/Parsing, but there is not that much traffic. Any hint on how to escalate? I opened a hint at https://www.mediawiki.org/wiki/Talk:Parsing.

@Ebab: I don't know how often the Parsing team goes through new parser-related tasks... Please see https://www.mediawiki.org/wiki/Parsing for contact info.

@Aklapper, thanks so much for your reply. I am a software developer of 20+ years, but entirely new to the workings of Wikipedia, and so not sure what to make of your suggestion.

The bug in question is a newly introduced regression that, according to the query posted by @Herzi.Pinki, breaks at a minimum 1,500 articles, and possibly uncounted more.

Which is the procedure to report/escalate that fact? Any hints much appreciated.

workaround as suggested in https://www.mediawiki.org/wiki/Topic:Up2kw5v8m8q3v477

reordered url parameters to move params to the first position, which changes &params to ?params. No wrong replacement any more.

Issue is still open (parser must work correctly).

This comment was removed by Ebab.
Legoktm triaged this task as Unbreak Now! priority.Nov 24 2018, 10:13 AM
Legoktm subscribed.

Sorry, this bug was not triaged properly, it should have been unbreak now!, given that this is a regression that shouldn't have happened and needs to be reverted.

ad80f0bca27c2b0905b2b137977586bfab80db34 is the first bad commit
commit ad80f0bca27c2b0905b2b137977586bfab80db34
Author: C. Scott Ananian <cscott@cscott.net>
Date: Mon Oct 15 16:39:19 2018 -0400

Handle <nowiki> and other markup consistently in image link/alt options

Use Parser::stripAltText() consistently to handle link and alt options
in both Parser::makeImage() and Parser::renderImageGallery().  This
ensures that link option text can use <nowiki> to escape problematic
text so that (for example) the following works:

```
[[File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>]]
<gallery>
File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>
</gallery>
```

Previously the handling of the link option in
Parser::renderImageGallery() used a bespoke `strip_tags` invocation
which didn't replace <nowiki>s, and the handling of the link option in
Parser::makeImage() didn't strip tags at all, nor did it replace
<nowiki>s.  For example, in Parser::makeImage() double quotes in
titles would be converted to embedded `<i>` tags before being passed
to Parser::parseLinkParameter(), with predictably poor results.

Tests added to confirm behavior of alt/link with HTML-escaped
entities and <nowiki>s exposed a bug in Remex: T207088.  Tests
will fail on PHP 7.0 until that is fixed.

Bug: T206940
Depends-On: Ide67bba20f771868c0e119cb2874464dcf1d758a
Change-Id: Ife4c0edaa85e0cb294c5d4c1e31d5d7d828d9df4

cc @cscott

ssastry subscribed.

@cscott: Unless this is a quick fix, the simplest solution for now would be to revert that patch.

Let me take a quick look. Apologies to German WP, this was a long holiday weekend in the US which probably delayed attention to this problem.

Confirmed the bug exists in core (but not in <gallery>) and in Parsoid (both in native links and <gallery>). &para[A-Za-z0-9=] should never be entity decoded, according to https://www.w3.org/TR/html5/syntax.html#character-reference-state ; investigating what's going on here.

EDIT: The special handling of semicolon-less &para only occurs in attribute values. A better workaround would be to write the URL as ?pagename=foo&amp;params=bar which is technically the correct way to write text containing a literal ampersand in both HTML and wikitext.

Change 475801 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/475801

Change 475821 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Uniformly use "wikitext entities" not "HTML5 entities"

https://gerrit.wikimedia.org/r/475821

Script to generate regexp matching all semicolon-less HTML entities is at P7844; this was used in the patch linked above.

Change 475801 merged by jenkins-bot:
[mediawiki/core@master] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/475801

Change 475821 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Uniformly use "wikitext entities" not "HTML5 entities"

https://gerrit.wikimedia.org/r/475821

Change 476145 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@wmf/1.33.0-wmf.6] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/476145

Change 476145 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.6] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/476145

@cscott Unfortunately, I wasn't able to confirm that the patch was fixing the issue so we rolled it back.

19:42 < ebernhardson> arlolra: your core patch is now on mwdebug1001
19:45 < arlolra> hmm
19:46 < arlolra> i'm not getting what I would expect from that
19:46 < arlolra> curl https://www.mediawiki.org/wiki/User:Arlolra/sandbox -H 
                 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' | grep mw-parser-out
19:47 < ebernhardson> arlolra:  `grep T210437 
                      /srv/mediawiki/php-1.33.0-wmf.6/includes/parser/Parser.php` on 
                      mwdebug1001 claims the update is in place :S
19:47 <+stashbot> T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" 
                  HTML5 entities - https://phabricator.wikimedia.org/T210437
19:49 < arlolra> hmm
19:49 < arlolra> ok, let it be
19:49 < arlolra> i'll notify cscott that there's work left to do

The output from the grep there was,

<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><div class="mw-parser-output"><p><a href="//tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&amp;language=de¶ms=47_N_16_E"><img alt="Welt-Icon" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/15px-Erioll_world.svg.png" width="15" height="15" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/23px-Erioll_world.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/30px-Erioll_world.svg.png 2x" data-file-width="256" data-file-height="256" /></a>

which still contains the ¶ms. If I misunderstood how to confirm this, please excuse me.

The oldid for the page I was rendering is https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=2986673

https://en.wikipedia.org/wiki/User:Cscott/T209236 is my test case, but I agree: even after purging the page I'm still seeing a &para; on the figure link. But that makes sense 'cuz you rolled it back...

I think you forgot to purge the parser cache when you were testing?

I think you forgot to purge the parser cache when you were testing?

I did :(

Mentioned in SAL (#wikimedia-operations) [2018-11-28T01:29:33Z] <ebernhardson> ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/includes/parser/Parser.php: SWAT: T209236 Protect legacy URL parameter syntax in link and alt options (duration: 00m 51s)

The oldid for the page I was rendering is https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=2986673

After @cscott purged the page, all is well again in the world

The group 0 wikis should now have the fix and the rest will ride out with the train this week.

I think this bug is fixed everywhere now. Resolving; reopen if you find a problem that can't be fixed by purging parsercache.