Page MenuHomePhabricator

Parsoid uses non-canonical URL encoding in <link> in <head>
Open, MediumPublic

Description

Compare:

: and $ are reserved charcters so URLs which differ in how these characters are encoded are not considered equal. The web server will consider them equivalent so this is not a big deal in practice, but it will split Varnish and other caches, and maybe confuse semantic web applications.

Event Timeline

Pchelolo added subscribers: BBlack, Pchelolo.

: and $ are reserved charcters so URLs which differ in how these characters are encoded are not considered equal. The web server will consider them equivalent so this is not a big deal in practice, but it will split Varnish and other caches, and maybe confuse semantic web applications.

As far as I know Varnish treats URI-encoded and non-uri-encoded URIs equally for REST API after T127370 and T127387 however I think it's specific to REST API.

Here the values in the link tag is actually coming from Parsoid so I guess Parsoid could duplicate the MW-specific URI encoding logic and use that for /wiki requests. However, as I understand in Varnish we already have logic to normalize url-encoded and non-uri-encoded and MW-specific-uri-encoded variants, so we might consider applying that logic to /wiki requests as well.

@BBlack probably has way better understanding of this.

Ideally this should be fixed in Parsoid. There are minor side effects other than Varnish cache splitting, such as client-side cache, or visited URL coloring.

ssastry renamed this task from RESTBase uses non-canonical URL encoding to Parsoid uses non-canonical URL encoding.Nov 6 2017, 5:00 PM
ssastry raised the priority of this task from Medium to High.
ssastry moved this task from Needs Triage to Next Up on the Parsoid board.
LGoto lowered the priority of this task from High to Medium.Mar 20 2020, 4:26 PM
LGoto moved this task from Backlog to Needs Investigation on the Parsoid board.

Needs re-investigation now that we've ported to PHP, who knows how we're encoding URLs now. Mostly likely we're using the same mechanism that core uses.

Parsoid/PHP still generates <link rel="dc:isVersionOf" href="//www.mediawiki.org/wiki/Manual%3A%24wgEnableAPI"/>.
That appears to be standard PHP urlencode:

$ psysh
>>> urlencode('Foo:$Bar')
=> "Foo%3A%24Bar"

If we were to change urlencoding we'd have to be very careful, since there are certainly cases where we can get tripped up: eg, File:Foo.jpg doesn't parse as a relative link, while File%3AFoo.jpg does. I think we typically use ./Foo.jpg now, and:

$ echo '[[Ke$ha]]' | php bin/parse.php 
<a rel="mw:WikiLink" href="./Ke$ha" title="Ke$ha" data-parsoid='{"stx":"simple","a":{"href":"./Ke$ha"},"sa":{"href":"Ke$ha"},"dsr":[0,9,2,2]}' class="mw-redirect">Ke$ha</a>

So is the issue just the encoding of the canonical links in our <head>? We could probably use the same encoding routine we use for titles...

cscott renamed this task from Parsoid uses non-canonical URL encoding to Parsoid uses non-canonical URL encoding in <link> in <head>.Apr 10 2020, 3:57 PM
cscott moved this task from Needs Investigation to Missing Functionality on the Parsoid board.