Page MenuHomePhabricator

Parsoid doesn't recognize interwiki shortcuts in the href attribute
Closed, ResolvedPublic0 Estimated Story Points

Description

Serializing HTML like <a href="de:Foo" rel="mw:WikiLink"> should result in wikitext that round-trips back to the same HTML, or at least to something that is an mw:WikiLink. Instead, the following happens:

$ echo "Foo <a href='de:Bar' rel='mw:WikiLink'>Bar</a> Baz" | node bin/parse.js --html2wt
Foo [[de:Bar|Bar]] Baz

$ echo "Foo [[de:Bar|Bar]] Baz" | node bin/parse.js
[...]
<p data-parsoid='{"dsr":[0,22,0,0]}'>Foo <link rel="mw:PageProp/Language" href="//de.wikipedia.org/wiki/Bar" data-parsoid='[...]'/> Baz</p>

The correct serialization of <a href="de:Foo" rel="mw:WikiLink"> would be [[:de:Foo]]:

$ echo "Foo [[:de:Bar|Bar]] Baz" | node bin/parse.js
[...]
<p data-parsoid='{"dsr":[0,23,0,0]}'>Foo <a rel="mw:ExtLink" href="//de.wikipedia.org/wiki/Bar" title="de:Bar" data-parsoid='[...]'>Bar</a> Baz</p>

Parsoid does perform colon-prefixing correctly for links to category pages:

$ echo "Foo <a href='Category:Foo' rel='mw:WikiLink'>Bar</a> Baz" | node bin/parse.js --html2wt
Foo [[:Category:Foo|Bar]] Baz

Event Timeline

[subbu@earth bin] echo "Foo <a href='de:Bar' rel='mw:WikiLink'>Bar</a> Baz" | node parse.js --html2wt
Foo [[de:Bar|Bar]] Baz

[subbu@earth bin] echo "Foo <a href='//de.wikipedia.org/wiki/Bar' rel='mw:WikiLink'>Bar</a> Baz" | node parse.js --html2wt
Foo [[:de:Bar|Bar]] Baz

Parsoid uses an interwiki matcher regexp derived from the config that uses the full url vs the de:Bar style used here. I didn't know that form of url in hrefs till now.

Parsoid uses an interwiki matcher regexp derived from the config that uses the full url vs the de:Bar style used here. I didn't know that form of url in hrefs till now.

Whether that's a Parsoid bug or a VE bug is debatable. VE lets you make "internal" links to titles like de:Foo (although it'll convert them to De:Foo) , which render as links in VE, but render as nothing once you save (because they get converted to language links), which is very confusing for the user. I decided to file it against Parsoid because I could find a dirty html2html round-trip, but I could see the argument that VE should be doing something here as well.

I am more inclined to say VE bug since the href doesn't resolve in the browser. I created a local file with that link tag (while setting the right base href), href="/Foo" resolves in the browser but href="de:Foo" doesn't. We could update our regexp and link serialization code to recognize this form, but, it seems like the better solution would be for VE to provide a resolved URL instead of a interwiki ref. to a title.

ssastry renamed this task from Parsoid serializes interwiki links to interlanguage links to VE should provide valid urls instead of interwiki shortcuts in the href attribute.Jun 7 2016, 4:54 PM
ssastry moved this task from Needs Triage to Non-Parsing-Team Tasks on the Parsoid board.

I disagree. This is user input. Demanding that end users write URLs when they want to make an interwiki link is not a reasonable outcome, surely?

ssastry renamed this task from VE should provide valid urls instead of interwiki shortcuts in the href attribute to Parsoid doesn't recognize interwiki shortcuts in the href attribute.May 2 2017, 10:31 PM
ssastry claimed this task.
ssastry triaged this task as Medium priority.

Change 351559 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: T136653: Handle interwiki shortcuts

https://gerrit.wikimedia.org/r/351559

Well, it shouldn't be href="de:Foo" because that ought to have de parsed as a URL protocol. But I could be convinced that href="./de:Foo" is a legit shorthand; you could imagine the PHP side doing a redirect if you actually hit https://en.wikipedia.org/wiki/de:Foo for instance.

you could imagine the PHP side doing a redirect if you actually hit https://en.wikipedia.org/wiki/de:Foo for instance.

In fact it does :)

(This is only done for interwikis with the "Forward" option (see list on https://en.wikipedia.org/wiki/Special:Interwiki).)

In that case, I have no problems. Although we should be careful about precedence, just in case somebody out there sets up File: as an interwiki link. Actual namespaces should take precedence over the "interwiki hack"... unless the forward option is set? What happens PHP side then?

Ok, so, it looks like there are a bunch of things to be resolved here wrt. these shortcuts. There are 3 possibly inputs:

  1. ./de:Foo
  2. de:Foo
  3. https://en.wikipedia.org/wiki/de:Foo

cscott is suggesting we only support 1. and 3. but not 2. since it has the possibility of accidentally covering up a protocol in scenarios where an interwiki prefix matches a protocol (ex: news). This means VE has to ensure that it adds the necessary "./" prefix.

And, then there is the forwarding option that matmarex noted above. So, if the forwarding option is not set, then it seems we shouldn't even be supporting these shortcuts. But, that means VE has to know about it which makes it a mess. So, do we then treat these as user errors?

Separately, there is the html2html issue in Parsoid for scenarios where the shortcut is invalid (either because the forwarding option is not set or because the href format is invalid). Right now, Parsoid will happily emit it an extlink for it which won't then html2html ...

[subbu@earth parsoid] echo "<a href='arxiv:Foo'>Foo</a>" | parse.js --html2html --normalize
<p>[arxiv:Foo Foo]</p>

This last bit is an edge case but it needs a reasonable solution nevertheless

Ok, so, it looks like there are a bunch of things to be resolved here wrt. these shortcuts. There are 3 possibly inputs:

  1. ./de:Foo
  2. de:Foo
  3. https://en.wikipedia.org/wiki/de:Foo

cscott is suggesting we only support 1. and 3. but not 2. since it has the possibility of accidentally covering up a protocol in scenarios where an interwiki prefix matches a protocol (ex: news).

That seems odd, as I don't think use case 1 is ever encountered in the wild.

This means VE has to ensure that it adds the necessary "./" prefix.

That'd be very meh.

VE should be escaping colon characters in titles, or else the URL in the href is not valid. Either ./File:Foo.jpg or File%3AFoo.jpg, your choice.

Assuming that's done (and I was almost certain it was already standard practice for VE), the use enters "de:Foo" and VE gives Parsoid "./de:Foo". Case 2 should never happen.

cf "not valid" -> step (5)(1) from https://url.spec.whatwg.org/#scheme-state linked from https://html.spec.whatwg.org/multipage/semantics.html#following-hyperlinks-2 step 5 from https://html.spec.whatwg.org/multipage/semantics.html#the-a-element "activation behavior" step 4.

Experimentally, if you type "de:project" the href on the VE canvas is https://en.wikipedia.org/wiki/De%3Aproject which is fine (case 3), and then VE munges this to <a href="De%3Aproject" rel="mw:WikiLink">project</a> before passing it to Parsoid, which is also fine (case 2 with proper URL encoding).

So VE is already URL encoding the colons. We should probably be careful how/where we decode the URL encoding in Parsoid so that URL schemes are not confused with mediawiki prefixes.

Change 351559 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T136653: Handle interwiki shortcuts

https://gerrit.wikimedia.org/r/351559

The interwiki shortcuts patch has been deployed, but it still doesn't work as expected because I think VE isn't adding the "./" prefix in hrefs. So, leaving this open.

[subbu@earth deploy] echo '<a rel='mw:WikiLink' href="en:Hampi" title="it:Luna">boo</a>' | parse.js --html2wt --prefix itwiki
[[en:Hampi|boo]]

[subbu@earth deploy] echo '<a rel='mw:WikiLink' href="./en:Hampi" title="it:Luna">boo</a>' | parse.js --html2wt --prefix itwiki
[[:en:Hampi|boo]]

My tests show that if I add the 'en:Hampi' target in the link form, I get the language link output rather than the interwiki output that indicates that VE isn't adding the "./" prefix.

See https://phabricator.wikimedia.org/T136653#3232588 -- href="en:Hampi" is still wrong. VE would generate href="en%3AHampi" for this case, which is correct.

EDIT: and to clarify a bit -- note that https://es.wikipedia.org/wiki/en%3AHampi is interpreted on the PHP side as a redirect. So there is no possibility of confusion with an actual article named Hampi in namespace en. So Parsoid cutting out a step in the redirect and making href="https://es.wikipedia.org/wiki/en%3AHampi" equivalent to href="https://en.wikipedia.org/wiki/Hampi" seems unobjectionable.

  • href="news:Foo" is an external UUCP link: [news:Foo]
  • href="news%3AFoo" or href="./news:Foo is an interwiki link: [[:news:Foo]]

Change 356312 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] T136653: Handle VE-style interwiki shortcuts

https://gerrit.wikimedia.org/r/356312

Change 356312 merged by jenkins-bot:
[mediawiki/services/parsoid@master] T136653: Handle VE-style interwiki shortcuts

https://gerrit.wikimedia.org/r/356312

After another bunch of deployed fixes, confirmed that this works in VE.