"&params" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms)
Closed, ResolvedPublic

Description

[[File:Erioll world.svg|15px|alt=Welt-Icon|link=//tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&language=de&params=47_N_16_E]]

creates a link

https://tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&language=de%C2%B6ms=47_N_16_E

where

&params=

is replaced by

¶ms=

It looks like &params is replaced, where only ¶ms should be.

The behavior is new, I would expect new since last version update.

Related Objects

Mentioned In
T212173: breaking change: link= <no-value> in icons in image descriptions is broken
T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group
T205546: Create Wiktionary Cantonese
T205710: Create Wikinews Limburgish
T206777: Create Wikipedia Shan
T207286: Time profiling: Replace millisecond granularity timers with microsecond granularity timers
T208470: Parsoid should nowiki-escape '}' in a table cell or insert a whitespace character, as appropriate
T187142: Deduplicate template styles in Parsoid
T184755: Consider not removing multiple blank lines/white space between paragraphs
T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities
Mentioned Here
rGPAR18a98afcd12c: Don't normalize mw-empty-elt empty paragraphs that come from source
T184755: Consider not removing multiple blank lines/white space between paragraphs
T187142: Deduplicate template styles in Parsoid
T204477: Create punjabi.wikimedia.org for Punjabi Wikimedians User Group
T205546: Create Wiktionary Cantonese
T205710: Create Wikinews Limburgish
T206777: Create Wikipedia Shan
T207286: Time profiling: Replace millisecond granularity timers with microsecond granularity timers
T208470: Parsoid should nowiki-escape '}' in a table cell or insert a whitespace character, as appropriate
T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" HTML5 entities
P7844 Generate regexp for semicolon-less HTML entities
rMWad80f0bca27c: Handle <nowiki> and other markup consistently in image link/alt options
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 11 2018, 3:21 PM
Herzi.Pinki updated the task description. (Show Details)Nov 11 2018, 3:22 PM
Herzi.Pinki updated the task description. (Show Details)
Herzi.Pinki updated the task description. (Show Details)
35292374 removed a subscriber: 35292374.Nov 18 2018, 2:34 PM

@Aklapper, seems to be an easy to fix problem, something like the typical off by 1 (you can hang me of course, if I'm completely wrong)
can you please schedule it to the correct project?

@Herzi.Pinki: That's already the case; see the "Tags" section in the right pane.

Herzi.Pinki triaged this task as High priority.Mon, Nov 19, 10:59 AM
Aklapper renamed this task from Parser fails to parse file link (with link parameter) to "&params" URL parameter (used in a link parameter in [[File]] markup) incorrectly parsed as "¶ms" (%C2%B6ms).Mon, Nov 19, 11:07 AM
Aklapper raised the priority of this task from High to Needs Triage.
Aklapper added a project: Regression.
Thgoiter added a subscriber: Thgoiter.
Ebab added a subscriber: Ebab.Thu, Nov 22, 11:07 AM
Pruem added a subscriber: Pruem.Thu, Nov 22, 7:28 PM
Ebab added a comment.EditedFri, Nov 23, 3:07 PM

Not sure whether the scope of this regression seems clear to everyone … at a minimum, it affects all pages on German Wikipedia that use Vorlage:Coordinate with an argument of "ICON2" for the parameter text -- all of those GeoHack links are now broken.

Not even a triage yet?

@Ebab: I don't know how often the Parsing team goes through new parser-related tasks... Please see https://www.mediawiki.org/wiki/Parsing for contact info.

Currently we have about 1500 articles in the German WP affected by this problem (https://de.wikipedia.org/w/index.php?search=insource%3A%22ICON2%22&title=Spezial%3ASuche&go=Artikel) with usually multiple occurrences of the problem. Any change in such a page will invalidate the cache and later affect all readers. If I do a change (e.g. a typo), I do not re-check all the coordinate links.

Similar:

But that's only WP:de.

@Aklapper, is this the wrong tool to file problem reports? IMHO this is a strange behaviour (I'm lucky that I could isolate it to the minimum) and it is interesting behaviour, as this could also effect other features: some string replacement goes wrong. I will ask at https://www.mediawiki.org/wiki/Parsing, but there is not that much traffic. Any hint on how to escalate? I opened a hint at https://www.mediawiki.org/wiki/Talk:Parsing.

Ebab added a comment.EditedFri, Nov 23, 8:15 PM

@Ebab: I don't know how often the Parsing team goes through new parser-related tasks... Please see https://www.mediawiki.org/wiki/Parsing for contact info.

@Aklapper, thanks so much for your reply. I am a software developer of 20+ years, but entirely new to the workings of Wikipedia, and so not sure what to make of your suggestion.

The bug in question is a newly introduced regression that, according to the query posted by @Herzi.Pinki, breaks at a minimum 1,500 articles, and possibly uncounted more.

Which is the procedure to report/escalate that fact? Any hints much appreciated.

workaround as suggested in https://www.mediawiki.org/wiki/Topic:Up2kw5v8m8q3v477

reordered url parameters to move params to the first position, which changes &params to ?params. No wrong replacement any more.

Issue is still open (parser must work correctly).

Ebab added a comment.Fri, Nov 23, 10:56 PM
This comment was removed by Ebab.
Legoktm triaged this task as Unbreak Now! priority.Sat, Nov 24, 10:13 AM
Legoktm added a subscriber: Legoktm.

Sorry, this bug was not triaged properly, it should have been unbreak now!, given that this is a regression that shouldn't have happened and needs to be reverted.

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptSat, Nov 24, 10:13 AM

ad80f0bca27c2b0905b2b137977586bfab80db34 is the first bad commit
commit ad80f0bca27c2b0905b2b137977586bfab80db34
Author: C. Scott Ananian <cscott@cscott.net>
Date: Mon Oct 15 16:39:19 2018 -0400

Handle <nowiki> and other markup consistently in image link/alt options

Use Parser::stripAltText() consistently to handle link and alt options
in both Parser::makeImage() and Parser::renderImageGallery().  This
ensures that link option text can use <nowiki> to escape problematic
text so that (for example) the following works:

```
[[File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>]]
<gallery>
File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>
</gallery>
```

Previously the handling of the link option in
Parser::renderImageGallery() used a bespoke `strip_tags` invocation
which didn't replace <nowiki>s, and the handling of the link option in
Parser::makeImage() didn't strip tags at all, nor did it replace
<nowiki>s.  For example, in Parser::makeImage() double quotes in
titles would be converted to embedded `<i>` tags before being passed
to Parser::parseLinkParameter(), with predictably poor results.

Tests added to confirm behavior of alt/link with HTML-escaped
entities and <nowiki>s exposed a bug in Remex: T207088.  Tests
will fail on PHP 7.0 until that is fixed.

Bug: T206940
Depends-On: Ide67bba20f771868c0e119cb2874464dcf1d758a
Change-Id: Ife4c0edaa85e0cb294c5d4c1e31d5d7d828d9df4

cc @cscott

ssastry assigned this task to cscott.Mon, Nov 26, 2:23 PM
ssastry added a subscriber: ssastry.

@cscott: Unless this is a quick fix, the simplest solution for now would be to revert that patch.

Let me take a quick look. Apologies to German WP, this was a long holiday weekend in the US which probably delayed attention to this problem.

cscott added a comment.EditedMon, Nov 26, 4:56 PM

Confirmed the bug exists in core (but not in <gallery>) and in Parsoid (both in native links and <gallery>). &para[A-Za-z0-9=] should never be entity decoded, according to https://www.w3.org/TR/html5/syntax.html#character-reference-state ; investigating what's going on here.

EDIT: The special handling of semicolon-less &para only occurs in attribute values. A better workaround would be to write the URL as ?pagename=foo&amp;params=bar which is technically the correct way to write text containing a literal ampersand in both HTML and wikitext.

Change 475801 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/475801

Change 475821 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Uniformly use "wikitext entities" not "HTML5 entities"

https://gerrit.wikimedia.org/r/475821

Script to generate regexp matching all semicolon-less HTML entities is at P7844; this was used in the patch linked above.

Change 475801 merged by jenkins-bot:
[mediawiki/core@master] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/475801

Change 475821 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Uniformly use "wikitext entities" not "HTML5 entities"

https://gerrit.wikimedia.org/r/475821

Change 476145 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@wmf/1.33.0-wmf.6] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/476145

Change 476145 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.6] Protect legacy URL parameter syntax in link and alt options

https://gerrit.wikimedia.org/r/476145

@cscott Unfortunately, I wasn't able to confirm that the patch was fixing the issue so we rolled it back.

19:42 < ebernhardson> arlolra: your core patch is now on mwdebug1001
19:45 < arlolra> hmm
19:46 < arlolra> i'm not getting what I would expect from that
19:46 < arlolra> curl https://www.mediawiki.org/wiki/User:Arlolra/sandbox -H 
                 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' | grep mw-parser-out
19:47 < ebernhardson> arlolra:  `grep T210437 
                      /srv/mediawiki/php-1.33.0-wmf.6/includes/parser/Parser.php` on 
                      mwdebug1001 claims the update is in place :S
19:47 <+stashbot> T210437: Sanitizer::stripAllTags shouldn't expand legacy "semicolon-less" 
                  HTML5 entities - https://phabricator.wikimedia.org/T210437
19:49 < arlolra> hmm
19:49 < arlolra> ok, let it be
19:49 < arlolra> i'll notify cscott that there's work left to do

The output from the grep there was,

<div id="mw-content-text" lang="en" dir="ltr" class="mw-content-ltr"><div class="mw-parser-output"><p><a href="//tools.wmflabs.org/geohack/geohack.php?pagename=Hirtenberg&amp;language=de¶ms=47_N_16_E"><img alt="Welt-Icon" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/15px-Erioll_world.svg.png" width="15" height="15" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/23px-Erioll_world.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Erioll_world.svg/30px-Erioll_world.svg.png 2x" data-file-width="256" data-file-height="256" /></a>

which still contains the ¶ms. If I misunderstood how to confirm this, please excuse me.

The oldid for the page I was rendering is https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=2986673

cscott added a comment.EditedWed, Nov 28, 1:14 AM

https://en.wikipedia.org/wiki/User:Cscott/T209236 is my test case, but I agree: even after purging the page I'm still seeing a &para; on the figure link. But that makes sense 'cuz you rolled it back...

I think you forgot to purge the parser cache when you were testing?

I think you forgot to purge the parser cache when you were testing?

I did :(

Mentioned in SAL (#wikimedia-operations) [2018-11-28T01:29:33Z] <ebernhardson> ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.6/includes/parser/Parser.php: SWAT: T209236 Protect legacy URL parameter syntax in link and alt options (duration: 00m 51s)

The oldid for the page I was rendering is https://www.mediawiki.org/w/index.php?title=User:Arlolra/sandbox&oldid=2986673

After @cscott purged the page, all is well again in the world

The group 0 wikis should now have the fix and the rest will ride out with the train this week.

greg added a subscriber: greg.Tue, Dec 4, 5:26 AM

Status?

cscott closed this task as Resolved.Tue, Dec 4, 4:18 PM

I think this bug is fixed everywhere now. Resolving; reopen if you find a problem that can't be fixed by purging parsercache.