Page MenuHomePhabricator

Advanced search stuck showing three dots: "defaultUri is undefined" (Swedish wikis)
Closed, ResolvedPublic

Description

Steps to reproduce:

  1. Go to https://sv.wikipedia.org/w/index.php?search=test%20test&title=Special%3AS%F6k

Expected result: The advanced search tools are displayed.

Actual result:
Only the three "loading" dots are displayed.

When adding &debug=true, I get the following errors in the browser's console:

TypeError: defaultUri is undefined; can't access its "clone" property[Learn More] Uri.js:217:5

Uri https://sv.wikipedia.org/w/resources/src/mediawiki.Uri/Uri.js?97558:217
isPageCreation https://sv.wikipedia.org/w/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js?c2828:13
getTranslationsItem https://sv.wikipedia.org/w/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js?c2828:30
attachMenu https://sv.wikipedia.org/w/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js?c2828:73
<anonymous> https://sv.wikipedia.org/w/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js?c2828:111
mightThrow jQuery
URIError: malformed URI sequence[Learn More] Uri.js:271:11

decode https://sv.wikipedia.org/w/resources/src/mediawiki.Uri/Uri.js?97558:271
parse https://sv.wikipedia.org/w/resources/src/mediawiki.Uri/Uri.js?97558:304
RegExpGetFunctionalReplacement self-hosted:4957
RegExpGlobalReplaceOptFunc self-hosted:5049
Symbol.replace self-hosted:4837
replace self-hosted:5587
parse https://sv.wikipedia.org/w/resources/src/mediawiki.Uri/Uri.js?97558:300
Uri https://sv.wikipedia.org/w/resources/src/mediawiki.Uri/Uri.js?97558:193
<anonymous> https://sv.wikipedia.org/w/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js?6f9f8:31
<anonymous> https://sv.wikipedia.org/w/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js?6f9f8:22

Tested browsers: Firefox, Chrome

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Background on the URL:

  • It is invalid UTF-8 because of the %F6
  • It is not produced using normal search forms

Steps to create this URL:

  1. Use Firefox
  2. Right-click a search the search field -> "Add a Keyword for this search..."
  3. Enter a keyword (e.g. "wiki") and press "Save"
  4. Search by writing "wiki test test" in the location bar

Now you have the URL I provided.


This is the main way I search wikis, so this is a hassle for me.

Change 477264 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/core@master] Add missing try-catch to decodeURIComponent() call

https://gerrit.wikimedia.org/r/477264

thiemowmde added subscribers: WMDE-Fisch, thiemowmde.

I was very curious and found this is actually a core bug in the mw.Uri module. Unfortunately I wasn't able to find the correct Phabricator tag yet.

T106244 is the same issue from a different perspective.

I'm not sure about just wrapping it in a try…catch. I suppose this might be a practical solution (especially necessary if such URLs can be produced by legitimate means like like the Firefox thing above). But the ideal solution would be to parse the percent-encoding with the right 8-bit character encoding (as outlined on T106244). Otherwise I fear that we will run into issues with other code that uses mw.Uri and doesn't handle this case.

Short answer: It's impossible to "guess" the "right 8-bit character encoding". It would be wrong to assume anything.

There might be a %… hex sequence in the input string that's valid in hundreds of encodings. How should an algorithm know which to pick? Blindly guessing? Why? Based on what information? Even worse, there might be a %… sequence that's not valid in any encoding, and must fail. What should the code do then? I mean, sure, we could document the fact that calling new Uri() might throw an exception, and expect all callers to wrap anything they do with the Uri class in try-catch. But why?

I keep running into what is essentially a design issue in the original JavaScript specification since probably 15 years. I always fixed it the same trivial way as shown in https://gerrit.wikimedia.org/r/477264, and never experienced any issues afterwards. I find it quite insane to realize this is known for 3 years (!), discussed to the ground, but nobody had the guts to do something about it. :'-( Can we please, please not let us stop making things better by (literally!) fear?

TL;DR: https://en.wikipedia.org/wiki/Perfect_is_the_enemy_of_good

Other industry examples:

It seems reasonable to handle server-side. Have we considered that?

As for mw.Uri specifically, I suppose it should be documented that when calling Uri.decode() directly, that it may throw for invalid input. It does not make sense in my opinion for all callers to try/catch such call. If they are given safe input, what will they do in case of a catch? That is genuine application failure, I'm not sure individual components can or should handle with that.

At some higher layer, any UI should catch exceptions before they reach the global handler and present an error of some kind, but not directly tied to decode(), unless they are expecting arbitrary user input.

Responding to query parameters on an authoritative and valid page view is not arbitrary user input I think, given it has passed through the server-side first. What does WebRequest::getVal() do?

T106244 remains relevant and I'm not sure how or if we should handle them separately for search.

I don't see what value a single, arbitrary "industry" example has. The user input is 100% wrong. There is no "right" way to make sense of it. Every "industry leader" I try does something different. GitHub does something different than https://www.google.com/search?q=a%01b, Google does something different than https://www.facebook.com/search/top/?q=a%01b, Facebook does something different than https://twitter.com/search?q=a%01b, …

I'm not going to waste more time on this. I will just stop using the broken Uri class then.

Change 478607 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/AdvancedSearch@master] Stop using broken Uri class from MediaWiki core

https://gerrit.wikimedia.org/r/478607

Short answer: It's impossible to "guess" the "right 8-but character encoding". It would be wrong to assume anything.

This is not true. In the context of MediaWiki, there is in fact exactly one correct way to handle these, which is the same way as our PHP code handles them, and it is described on T106244.

I find it quite insane to realize this is known for 3 years (!), discussed to the ground, but nobody had the guts to do something about it. :'-( Can we please, please not let us stop making things better by (literally!) fear?

I find it very reasonable to prefer the current failure mode (exception is thrown, nothing works) to the failure mode of your proposed patch (everything seems to work, but later on e.g. VisualEditor tries to save the page with incorrectly decoded title, causing the page to be duplicated under a broken title, or possibly API to finally produce an error and lose user's work, or some other unimaginable unpleasant things happen).

While it might be possible to make an educated guess based on heuristics that might turn out to be true or not in the individual case (e.g. just assuming Windows-1251 will most certainly be correct in the vast majority of situations), no code can know what I meant when I manually type …?title=%86%96%96. It might even be EBCDIC and decode as "foo" then.

I see a few possibilities to deal with unknown encodings:

  • Ditch the query string entirely, just to be sure. User satisfaction: zero.
  • Pass it through without doing anything else. This is relevant in situations where a user manually typed …?title=100%_more and actually expects it to survive exactly as given. And this is what my patch implements.
  • Apply something like .replace( /%[\da-f]{2}/gi, ' ' ) to – at least – remove hex sequences and replace them with a space or some (possibly invalid) placeholder character.
  • Try to slap a Windows-1251 decoder on the string, but be prepared to fall back to some of the possibilities above in case it still fails. E.g. …?title=100%_more can never be decoded.
  • Implement an incredibly complex algorithm to guess the encoding, without any guarantee it can ever be correct in all situations. Oh, and even then …?title=100%_more won't work without having a fallback in place.

There is no world in which it is a good idea to let the empty (!) constructor new Uri() throw an exception and expect devs using this class to accept this behavior.

[…] or some other unimaginable unpleasant things happen).

Fear. This is just fear. I'm not going to accept this as a driving force for any decision we make.

Using any heuristic other than the one used by existing MediaWiki PHP code is incorrect. That heuristic is well-defined (using a specific encoding based on wiki content language) and I described on T106244 how it should be implemented in JavaScript. I feel weird having to link that task for the third time with no one acknowledging this.

At a glance – forgive me if the numbers are off – mediawiki.Uri seems to be used by 32 different extensions in our version control, 26 of them deployed on Wikimedia wikis. Call me old-fashioned, but I think it is wholly appropriate to have some – not fear, but respect – when making subtly breaking changes in such areas.

For the record, I have not suggested that we do our best to try to make this broken input work. If I said something to that extent, my apologies for confusing you.

Invalid urls should imho be rejected. Either with a 40x response from the server, or perhaps with a 30x redirect to something else. For example, it could redirect to a variant with whatever url segment is affected stripped (e.g. strip all query params), or if path is affected, direct to Special:BadTitle.

Once the page is rendered, the client should not be responsible for this. Any such input passed to mw.Uri by client-side code (where the input is not location.href), should deal with the exception themselves as it is their own input. When the input is location.href, it is the servers responsibility. If it wasn't valid, it should not have responded the way it did.

I'm aware that doing this server-side may currently be complicated due to the absence of a proper url-based router in MediaWiki PHP. This means that we'll either have to apply it generically to everything, or have some hacky way to disable the validation for endpoints that wish to accept it or handle it in a different way.

Change 478607 merged by jenkins-bot:
[mediawiki/extensions/AdvancedSearch@master] Stop using broken Uri class from MediaWiki core

https://gerrit.wikimedia.org/r/478607

More general bug about mediawiki.Uri is T106244.