Page MenuHomePhabricator

Inconsistent hrefs between Parser and Parsoid HTML (Parsoid doesn't percent-encode ' as %27)
Closed, DeclinedPublic1 Estimated Story Points

Description

For many years browsers could show visited links with a different color. Usually non-visited are blue and visited are something like purple.

It works this way also in Wikipedia, but when editing an article using VisualEditor, this is not always respected. I suspect that it is like this with articles that include an apostrophe (') in the title, but there may be other reasons.

By itself this issue is by no means high-priority, but it may point at an inconsistency between how the links are handled in VE and in the rendered page, and this inconsistency may cause other issues.

Event Timeline

@Amire80 : Link to any example, where it can be reproduced?

Which browser / OS?

OSX El Capitan, Firefox 45.

Demo page: https://en.wikipedia.org/wiki/User:Amire80/T134125

  1. Go to the page.
  2. Click both links.
  3. Go to the page again. They appear in the "visited" color.
  4. Edit the page in VE. "ASCII" appears as visited, and "Fisherman's Wharf, San Francisco" appears as non-visited.
Danny_B raised the priority of this task from Low to Medium.May 1 2016, 7:54 PM

Page view has href="/wiki/Fisherman%27s_Wharf,_San_Francisco"
Page edit has href="https://en.wikipedia.org/wiki/Fisherman's_Wharf,_San_Francisco"

Such links are not considered equal, thus the latter is rendered as unvisited. After visiting the latter link, it renders properly as visited in edit mode.

The issue is the %27 vs. ' in the link.

Confirmed in Firefox 46 @ Win 7. Needs to be confirmed in other browsers to see if it's browser only issue.

@Jdforrester-WMF: I'm considering retitling this task to something like "Inconsistent hrefs between view and visual edit mode" no matter if it will be Firefox only issue for the sake of consistency (besides the edit mode href is not using protocol relative URL anyway unlike others). IMO both hrefs should be exactly the same since various gadgets can rely on it (i.e. link styling).

Jdforrester-WMF renamed this task from Some links in VisualEditor are not shown as visited by the browser to Inconsistent hrefs between Parser and Parsoid HTML.May 2 2016, 4:38 PM
Jdforrester-WMF lowered the priority of this task from Medium to Lowest.
Jdforrester-WMF set the point value for this task to 1.
Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.
matmarex added a subscriber: matmarex.

This is not a VisualEditor issue, the problem is in Parsoid output. Rather than opening VE, you can also test with these two pages:

It seems that Parsoid doesn't percent-encode ' as %27. (Which is valid, it's just needlessly different from the PHP Parser.)

matmarex renamed this task from Inconsistent hrefs between Parser and Parsoid HTML to Inconsistent hrefs between Parser and Parsoid HTML (Parsoid doesn't percent-encode ' as %27).Aug 31 2018, 4:18 PM

I don't consider this a serious issue worth fixing. We want to move towards reducing unnecessary encoding. Making Parsoid match the current output would be a step away from that. Let me know if there are strong reasons to fix Parsoid. Maybe, if anything, we could consider tweaking the PHP parser output instead.

MediaWiki's canonical URLs (in <link rel="canonical">) also percent-encode apostrophes. I think both of these eventually end up calling wfUrlencode(). That has some rationale for various decisions (probably worth checking if Parsoid behavior is the same for other special characters), but when it comes to the apostrophe, all it has to say is that it "seems kind of scary".

I don't think either behavior is better, but it would make sense for Parsoid to generate the same URLs as MediaWiki itself.

ssastry added a subscriber: cscott.

MediaWiki's canonical URLs (in <link rel="canonical">) also percent-encode apostrophes. I think both of these eventually end up calling wfUrlencode(). That has some rationale for various decisions (probably worth checking if Parsoid behavior is the same for other special characters), but when it comes to the apostrophe, all it has to say is that it "seems kind of scary".

Can you point me to this comment?

I don't think either behavior is better, but it would make sense for Parsoid to generate the same URLs as MediaWiki itself.

I don't think this difference is a big breaking issue. When we move everything to use Parsoid output, this will be a non-issue. Till then, this is a temporary annoyance for maybe some browsers especially since the two urls are equivalent and as @cscott noted, the browsers shouldn't be treating them as different links.

MediaWiki's canonical URLs (in <link rel="canonical">) also percent-encode apostrophes. I think both of these eventually end up calling wfUrlencode(). That has some rationale for various decisions (probably worth checking if Parsoid behavior is the same for other special characters), but when it comes to the apostrophe, all it has to say is that it "seems kind of scary".

Can you point me to this comment?

In the doc comment for wfUrlencode: [https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/GlobalFunctions.php$332]

* But + is not safe because it's used to indicate a space; &= are only safe in
* paths and not in queries (and we don't distinguish here); ' seems kind of
* scary; and urlencode() doesn't touch -_. to begin with.  Plus, although /
* is reserved, we don't care.  So the list we unescape is:

(The comment originally comes from rMW6b1a9d4e4ea9: Unescape more "safe" characters when producing URLs, for added prettiness., which doesn't explain much.)


As a side note, when I blamed this comment trying to find where it came from, I also found that in the past we've switched back-and-forth between encoding and not encoding it:

Reading those commits and associated tasks, it seems that we discovered that Firefox and old Opera will canonicalize URLs with an apostrophe to one of the forms, and get stuck in a redirect loop if MediaWiki tries to canonicalize into the opposite form. Chrome instead canonicalized the tilde with the same effect. (Since then we just stopped doing HTTP redirects to canonicalize the percent-encoding in URLs.)