Page MenuHomePhabricator

URL encoded values using fallback 8-bit encoding (invalid UTF-8) cause mediawiki.Uri to crash
Closed, ResolvedPublic

Description

If someone tries to load this link: https://en.wikipedia.org/wiki/Special:ContentTranslation?from=en&to=es&campaign=article-recommendation&page=Apollo%96Soyuz_Test_Project it seems to hang

However, just replacing the %96 lets the page load: https://en.wikipedia.org/wiki/Special:ContentTranslation?from=en&to=es&campaign=article-recommendation&page=Apollo–Soyuz_Test_Project

diff:

-Apollo%96Soyuz_Test_Project
+Apollo–Soyuz_Test_Project

Other test URIs (throw error on load)

More test URIs:

Logstash

Error:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2020.09.01/clienterror/?id=AXRLFHENMQ_08tQaelA_

Stack trace:

 	at Uri https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:254:484
at isInternal https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:127:796
at newFromUri https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:128:113
at isUserUri https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:110:151
at https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:111:733
at https://he.m.wikipedia.org/w/load.php?lang=he&modules=jquery&skin=minerva&version=tqh7e:40:526
at grep https://he.m.wikipedia.org/w/load.php?lang=he&modules=jquery&skin=minerva&version=tqh7e:5:756
at filter https://he.m.wikipedia.org/w/load.php?lang=he&modules=jquery&skin=minerva&version=tqh7e:41:514
at https://he.m.wikipedia.org/wiki/%D7%93%D7%91%D7%A9_%D7%9E%D7%90%D7%A0%D7%95%D7%A7%D7%94#%D7%9E%D7%90%D%D7%A57%A4%D7%99%D7%99%D7%A0%D7%99%D7%9D:111:696
at mightThrow https://he.m.wikipedia.org/w/load.php?lang=he&modules=jquery&skin=minerva&version=tqh7e:49:154
at https://he.m.wikipedia.org/w/load.php?lang=he&modules=jquery&skin=minerva&version=tqh7e:49:818

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Amire80 triaged this task as Medium priority.Jul 18 2015, 10:47 PM
Amire80 set Security to None.
Amire80 added a subscriber: Amire80.

Does this happen naturally by using the ContentTranslation UI, or did you type the link yourself?

If it happens through the UI, can you please tell me what did you click to arrive at it?

It happens when I construct a link dynamically and allow the user to click on it. I ran into this when building this demo for the Wikimania hackathon: http://recommend.wmflabs.org/

NOTE: that's just a mockup and it won't be stable over the coming days but it's basically just a direct link to CX.

(sorry for the late reply)

Amire80 lowered the priority of this task from Medium to Low.Oct 2 2015, 3:19 PM
Amire80 moved this task from CX6 to CX7 on the ContentTranslation board.
Nikerabbit added a subscriber: Nikerabbit.

This is failing all over the modules with similar backtrace from each:

Exception in module-execute in module ext.visualEditor.targetLoader:
17:56:44.237 load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector:90 URIError: URI malformed
    at decodeURIComponent (<anonymous>)
    at Function.Uri.decode (<anonymous>:71:989)
    at <anonymous>:72:427
    at String.replace (<anonymous>)
    at Uri.parse (<anonymous>:72:294)
    at new Uri (<anonymous>:71:36)
    at <anonymous>:70:1574
    at new Uri (<anonymous>:70:1693)
    at <anonymous>:621:3688
    at mw.loader.implement.visualeditor-autosave-modified-prompt-accept (<anonymous>:627:485)

I am not sure what is the expected behavior here, but I do know various places are not expecting this error.

The issue is not with how the code is registered or delivered, the trace comes from from ext.visualEditor.targetLoader. Tagging accordingly.

Krinkle raised the priority of this task from Low to Needs Triage.Aug 20 2018, 3:27 PM
matmarex added a subscriber: matmarex.

Percent-decoding Apollo%96Soyuz_Test_Project gives a byte sequence that is not valid in UTF-8. The value 0x96 (10010110) can only appear as the 2nd/3rd/3th byte in a multibyte sequence (https://en.wikipedia.org/wiki/UTF-8#Description). This is why the errors appear.

Just running new mw.Uri() on the affected page, or on any page with a URL like that (e.g. https://en.wikipedia.org/wiki/Special:BlankPage?page=Apollo%96Soyuz_Test_Project), gives the same error, regardless of VisualEditor or ContentTranslation.

The URL given here is definitely malformed. But mw.Uri probably shouldn't fail like this.

Thanks @matmarex. It seems that PHP and/or Varnish do support this invalid sequence. The article can be viewed via https://en.wikipedia.org/wiki/Apollo%96Soyuz_Test_Project.

MediaWiki provides the canonical url as https://en.wikipedia.org/wiki/Apollo%E2%80%93Soyuz_Test_Project, which works fine.
Manually encoding it in JS, results in the same encodeURIComponent('Apollo–Soyuz_Test_Project') === "Apollo%E2%80%93Soyuz_Test_Project".

I think we should reject or redirect these in MediaWiki core. While we could handle this client-side on a case-by-case basis, I don't think that scales well.

Thanks @matmarex. It seems that PHP and/or Varnish do support this invalid sequence. The article can be viewed via https://en.wikipedia.org/wiki/Apollo%96Soyuz_Test_Project.

I think this is only thanks to some legacy encoding configuration for English Wikipedia. I am not sure what config controls it (apparently not $wgLegacyEncoding, since that says it only applies to contents of pages?).

0x96 corresponds to in the Windows-1252 encoding: https://en.wikipedia.org/wiki/Windows-1252#Character_set. The same character in Unicode is U+2013 EN DASH.

The same URL does not load an article by the same title on other Wikipedias. Compare:

I tested and http://localhost/wiki/%96 also loads the page titled on my local wiki. I don't have $wgLegacyEncoding set. I suspect this is some Apache feature where it guesses the encoding and transforms it before passing the params to our PHP code. No idea where it is configured.

Never mind, I found where MediaWiki converts the encoding. In WebRequest::getGPCVal() we call Language::checkTitleEncoding() (using content language), which attempts to interpret invalid UTF-8 as text encoded in $fallback8bitEncoding (which is defined per-language). For English, it's 'windows-1252', while for Polish, it's 'iso-8859-2', which is why my example works differently on Polish Wikipedia.

mediawiki.Uri should also support this. But there might not be a way to deal with different encodings in JS. It might require defining the mappings from fallback encoding to UTF-8 in our code.

matmarex renamed this task from URL encoded values to the page parameter cause the CX page to not load to URL encoded values using fallback 8-bit encoding (invalid UTF-8) cause mediawiki.Uri to crash.Aug 20 2018, 4:37 PM

I looked into this for about an hour. I'm not planning to work on implementing it, but hopefully this will be helpful to anyone who wants to:

  • Web Encoding API https://encoding.spec.whatwg.org/ is the future standard way to do encoding conversion in JavaScript, but it is not implemented in all of the browsers we support (e.g. IE 11)
  • If we want to implement a fallback, this project looks excellent: https://github.com/inexorabletash/text-encoding. In particular, we can easily adapt it to remove the encodings not used by any of MediaWiki's languages, and to create a RL module that will only load the definition for the one encoding we need (corresponding to a wiki's content language)

We use two values for $fallback8bitEncoding that are apparently not supported by these, but they seem to be roughly equivalent to supported encodings (or subsets of them? not sure about the differences):

  • zh-hans: 'windows-936' → 'gbk'
  • zh-hant: 'windows-950' → 'big5'
Krinkle added a project: patch-welcome.

Patch welcome, but from our team's perspective we'll instead focus on T103379.

Krinkle changed the task status from Open to Stalled.Apr 12 2020, 10:37 PM

Change 624232 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/MobileFrontend@master] Edit link should fallback to old href value for invalid UTF-8 links

https://gerrit.wikimedia.org/r/624232

@matmarex although there is a long term fix, I don't think hanging is great here and I don't think the task is stalled from that perspective.

Given mw.Uri can throw an exception it should be handled and dealt with more gracefully than hanging the page. I've submitted a patch for doing that in MobileFrontend which is also impacted by this bug:

Change 624232 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Edit link should fallback to old href value for invalid UTF-8 links

https://gerrit.wikimedia.org/r/624232

Change 624264 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/WikimediaEvents@master] Search satisfaction schema is required to handle Uri exceptions

https://gerrit.wikimedia.org/r/624264

Change 624266 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/VisualEditor@master] Disable ArticleTargetLoader on invalid UTF-8 URIs

https://gerrit.wikimedia.org/r/624266

Jdlrobson changed the task status from Stalled to Open.Sep 3 2020, 8:50 PM

The issue with the above page was due to https://de.wikipedia.org/wiki/MediaWiki:Gadget-revisionjumper.js not a usage of mw.Uri (I fixed that gadget) ;
as well as issues inside https://en.wikipedia.org/w/extensions/WikimediaEvents/modules/ext.wikimediaEvents/searchSatisfaction.js and https://en.wikipedia.org/w/extensions/VisualEditor/modules/ve-mw/preinit/ve.init.mw.ArticleTargetLoader.js

It seems strange that these gadgets and this schema is running on this page but I'll leave that for another bug.

Jdlrobson raised the priority of this task from Low to Medium.Sep 8 2020, 11:21 PM
Jdlrobson added a project: MediaViewer.
Jdlrobson updated the task description. (Show Details)

Adding multimedia viewer as the https://ca.wikipedia.org/wiki/Llaunes_de_sopa_Campbell%27s#/media/Fitxer:Campbells_Soup_Cans_MOMA_reduced_80%.jpg example seems to be triggering a lot of these.

Change 624266 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Disable ArticleTargetLoader on invalid UTF-8 URIs

https://gerrit.wikimedia.org/r/624266

Change 624264 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Search satisfaction schema is required to handle Uri exceptions

https://gerrit.wikimedia.org/r/624264

711 instances of this in the last 24hr
I'm expecting a big reduction in this due to fixes in T261799. After those changes go live I'll update this task with remaining steps so we can close it out.

Jdlrobson claimed this task.

The bulk of these errors are fixed.

I've opened new tasks for the remaining errors:
T268058, T268059, T268060, T268062, T268063, T268064, T268065, T268067