Page MenuHomePhabricator

Links with plus signs are not urlencoded
Open, MediumPublic

Description

When an article contains links to titles with plus signs, the plus signs are not escaped in the href attribute of the link.

For example, the article [[Sort]] contains a link to [[Sort (C++)]]. When this content is rendered by mobileview, the link is correctly emitted as /Sort_(C%2B%2B), whereas with Parsoid it is emitted as /Sort_(C++)

Event Timeline

ssastry triaged this task as Medium priority.May 27 2016, 11:13 PM

@ssastry and @GWicke Do you think this is a bug or works as intended?

Basically, MW API returns the links encoded, Parsoid does not.
Example from [[Sort]]:
Parsoid: "/wiki/Sort_(C++)"
MW: "/wiki/Sort_(C%2B%2B)"

FWIW, looks like browsers don't have issues with it. If I enter en.wikipedia.org/wiki/Sort_(C++) in the address bar it automatically redirects to en.wikipedia.org/wiki/Sort_(C%2B%2B). Looks like we have configured a 301 redirect for this. But it does make a difference for the API.

For the downstream bug I added a patch to work around this but it's a bit ugly to special case this depending on whether the link comes from Parsoid or MW API. I'm OK with this is it's the right thing to do, but as I said above I'm not sure what the right thing here is. If you plan on changing current behavior please let us know. It would also require a new content format version since that would break backwards compatibility.

But it does make a difference for the API.

I don't think it does:

Both return the same content / render.

I would normally expect them to also share the same cache entry in varnish, as we are normalizing percent encodes there to avoid fragmentation. However, some manual testing shows that in this case the entry is duplicated (separate counters in x-cache). It's possible that + is not one of the chars we ended up normalizing. This is something we should investigate, as it could potentially affect purging.

Edit: [It does indeed look like we didn't get to + yet](https://github.com/wikimedia/operations-puppet/blob/5cbf6cadaa2af6e04a473d2a6b22d1d5b38dc0e2/templates/varnish/normalize_path.inc.vcl.erb#L40-L49). This is essentially T127387#2086805, where I listed general non-delimiter decoding as a TODO.

So, to summarize: The HTML encoding is okay (in line with HTML5), but normalization in Varnish is still incomplete & needs to be completed to cover all non-delimiters.

The problem for the app is that since it can also handle content from MW API it assumes that the links are encoded, and therefore will decode the links, which makes the links invalid. + becomes a space after the decoding step. The same problem probably exist for any client moving from MW API to Parsoid content.
Anyways I've got a patch to workaround the situation in the app in code review. I guess we have to have this special casing as long as the app supports content from both MW API and Parsoid.

Parsoid still encodes links, but focuses on chars that actually need to be encoded in HTML5. Example:

wikitext: [[100%]]
html:     <a rel="mw:WikiLink" href="./100%25" title="100%" id="mwAg">100%</a>

@GWicke Thank you for the heads-up. The patch https://gerrit.wikimedia.org/r/#/c/302745/2 is going to fix this on the MCS side. I've added @ssastry as a reviewer, mainly because I want to make him aware that I've copied some Parsoid Utils.js code

See _sanitizeTitleURI and _decodeURI in https://gerrit.wikimedia.org/r/#/c/302745/2/lib/transforms.js. I basically added the + to this function's regular expression since that was one we wanted to fix in T136223.

Note that it does much more to make the links appear more like in MediaWiki core, to make the links shareable with the site and the transition for a client from the old MW API to MCS API easier:

  • Replaces ./ with /wiki/
  • Removes page title if it's the same as the current title so the client doesn't think this is a link to a different page
  • Anchorencodes fragments
  • Encodes title while preserving slashes and some other special characters

Thanks. I lost track of this. But, as I noted on https://gerrit.wikimedia.org/r/#/c/302745/2 .. this is probably relevant to any client that inspects URLs and makes assumptions about what the URL is going to looks like. We'll have to discuss how many of these should translate over to Parsoid code and where changes are warranted in M/W output. I presume backward compatibility requirements will push Parsoid to make the changes, but at this point, VE, CX, and Flow's usage would have to be factored in as well.

@ssastry Thanks for considering moving some of this to Parsoid. Please keep us informed about the progress of this, as it will require further testing in MCS whenever the link format changes. Probably a small spec version bump would be needed when Parsoid does change it.

Yes, when we make any of these breaking changes in Parsoid, we'll update the spec and bump the version number.

I don't think you should be "decoding" + symbols in URLs to spaces normally; that is only done for query string decoding. https://en.wikipedia.org/wiki/Percent-encoding has more details. This might also have changed when we ported Parsoid to PHP...

I believe Parsoid is correct, and the MCS code which does the "percent decode" is incorrect. The spec for percent decode is here https://url.spec.whatwg.org/#percent-decode and it should leave plus signs alone. It's only "query decoding" which mangles plus signs, that's not what should be applied to href attributes.

@ssastry and @cscott since MCS is going to be decommissioned and Parsoid is following the proper standard, I'm inclined to decline this task. Any objections?