html -> wt: Parsoid sometimes trips up on | chars in hrefs
Closed, ResolvedPublic

Description

On this HTML below, Parsoid falls through to the invalid href code path in the link handler and serializes it to just the link text.

<a href="https://stats.wikimedia.org/v2/#/fr.wikipedia.org/reading/page-views-by-country/normal|map|2-Year~2016060100~2018071100|~total">9</a>
<a href="https://stats.wikimedia.org/v2/#/fr.wikipedia.org/reading/page-views-by-country/normal%7Cmap%7C2-Year~2016060100~2018071100%7C~total">10</a>

See T199718#4431580

ssastry created this task.Jul 18 2018, 5:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2018, 5:00 PM
ssastry triaged this task as Normal priority.Jul 18 2018, 5:00 PM
ssastry moved this task from Backlog to html2wt on the Parsoid board.Jul 23 2018, 10:16 AM

There | char protection is fine because this is trying to serialized as a wikilink.

The interwiki matcher has,

{
  "prefix": "stats",
  "local": "",
  "url": "https://stats.wikimedia.org/$1"
},

which the href matches.

Maybe the right thing is to add (v2\/)? to https://github.com/wikimedia/parsoid/blob/master/lib/html2wt/LinkHandler.js#L197
but that doesn't seem very general.

Maybe there needs to be a hardcoded exception for statswiki which doesn't seem to be a wiki.

I don't think this is specific to this interwiki. You get the same problem with e.g. this:

<a href="https://www.google.com/search?q=a|b">x</a>

…which serializes to x instead of [https://www.google.com/search?q=a|b x] (I guess it tries to serialize to [[google:a|b|x]], but that's invalid).

{
  "prefix": "google",
  "url": "https://www.google.com/search?q=$1",
  "protorel": ""
},

is in the interwiki map, so that is also trying to serialize to the wikilink form and then bailing.

We can add another clause to https://github.com/wikimedia/parsoid/blob/master/lib/html2wt/LinkHandler.js#L190-L200
so that it tests for | and falls back to extlink if it's present. Or, percent encode the | since interwiki targeta are just pattern matched into the url where the $1 occurs.

Or, percent encode the | since interwiki targeta are just pattern matched into the url where the $1 occurs.

I just tried and this does not actually work! :o Wikitext syntax like [[google:a%7cb]] or [[stats:a%7cb]] does not produce a link. I am guessing that we still validate this "title" against the usual rules? In that case, links with < >, [ ], or other characters invalid in page titles probably cause the same issue as |.

Arlolra assigned this task to Sbailey.Jul 24 2018, 8:51 PM

Change 447728 had a related patch set uploaded (by Sbailey; owner: Sbailey):
[mediawiki/services/parsoid@master] html -> wt: Parsoid sometimes trips up on | chars in hrefs

https://gerrit.wikimedia.org/r/447728

I am guessing that we still validate this "title" against the usual rules?

Yeah, it gets passed to Title::newFromText() which applies the valid title regexp, so we should bail more generally on env.isValidLinkTarget

<a href="https://stats.wikimedia.org/v2/#/fr.wikipedia.org/reading/page-views-by-country/normal|map|2-Year~2016060100~2018071100|~total">9</a>
<a href="https://stats.wikimedia.org/v2/#/fr.wikipedia.org/reading/page-views-by-country/normal%7Cmap%7C2-Year~2016060100~2018071100%7C~total">10</a>

What's fun here is that the bar occurs in fragment identifier, so these are two distinct cases. The former should fall back to extlink syntax because the bar would divide the link content in the wrong place. The latter can serialize to the wikilink form, since the it wouldn't be part of the link target.

[[stats:v2/#/fr.wikipedia.org/reading/page-views-by-country/normal%7Cmap%7C2-Year~2016060100~2018071100%7C~total|10]]

Change 447728 merged by jenkins-bot:
[mediawiki/services/parsoid@master] html -> wt: Parsoid sometimes trips up on | chars in hrefs

https://gerrit.wikimedia.org/r/447728

Arlolra closed this task as Resolved.Aug 7 2018, 11:35 PM

Change 451198 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Remove unnecessary pattern from interwiki checks

https://gerrit.wikimedia.org/r/451198

Change 451198 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove unnecessary pattern from interwiki checks

https://gerrit.wikimedia.org/r/451198