Page MenuHomePhabricator

Can't create links to hash fragments with square brackets
Closed, ResolvedPublic

Description

  1. Open VE
  2. Create a link
  3. Paste Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon, as one might reasonably do if copying the URL from https://en.wiktionary.org/wiki/Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon
  4. Save the page

Observed:
The square brackets are not encoded, the link is invalid and raw wikitext is rendered
[[Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon]]
Expected:
As link is rendered

Event Timeline

VE produces the following HTML:

<p><a href="./Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon" rel="mw:WikiLink">Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon</a></p>

which converts to

[[Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon]]

which converts back to

<p>[[Wiktionary:Beer_parlour#[on_hold]_Temporary_accounts_will_be_rolled_out_soon]]</p>

I'm not sure if we are supposed to support [ in a hash fragment. An HTML5 validator claims it isn't valid, but we output in various places in MW, for example in Vector:

<a class="vector-toc-link" href="#[on_hold]_Temporary_accounts_will_be_rolled_out_soon">

and DiscussionTools

Latest comment: <a href="#c--sche-20250916214200-[on_hold]_Temporary_accounts_will_be_rolled_out_soon">11 hours ago</a>

I'm not sure if we are supposed to support [ in a hash fragment. An HTML5 validator claims it isn't valid, but we output in various places in MW [...]

Reading this piqued my curiosity :) I'm not 100% sure if this is the definitive answer or not, but https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Fragment links to https://www.rfc-editor.org/rfc/rfc3986.html#section-3.5, which says that the syntax for hash fragments is fragment = *( pchar / "/" / "?" ) (which can be expanded using Appendix A of that RFC). From the expansion done in this Stack Overflow answer (which at a first glance appears to correctly match the RFC's syntax definitions), it would appear that square brackets are indeed not supposed to be included in a hash fragment IIUC.

https://www.w3.org/TR/2011/WD-html5-20110525/urls.html#parsing-urls adds some chracters to RFC 3986, specifically listing U+005B .. U+005E, where 5B and 5D are the two square brackets, so it appears they are allowed.

For our current purposes the HTML5 spec is probably more relevant, which links out to https://url.spec.whatwg.org/#url-fragment-string.

The URL code points are ASCII alphanumeric, U+0021 (!), U+0024 ($), U+0026 (&), U+0027 ('), U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+002A (*), U+002B (+), U+002C (,), U+002D (-), U+002E (.), U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+003F (?), U+0040 (@), U+005F (_), U+007E (~), and code points in the range U+00A0 to U+10FFFD, inclusive, excluding surrogates and noncharacters.

I believe HTML5 has greatly liberalized the rules for ID fragments, and Sanitizer::escapeIdInternal() seems to agree with me. So I would assume that <a class="vector-toc-link" href="#[on_hold]_Temporary_accounts_will_be_rolled_out_soon"> is indeed valid, and experimentally browsers treat it as valid and resolve it correctly, and that this is Parsoid's fault for turning this into an invalid [[...]] syntax. (I suspect that if we generated extlink [ ... ] or autolink https://.... syntax for it we'd do that correctly, but that's worth checking as well.)

Ah, interesting! Thanks for the links, all, and apologies for apparently being wrong about this :]

Change #1189255 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Ensure wikilink fragments are appropriately escaped

https://gerrit.wikimedia.org/r/1189255

Turns out we had a variant of this same problem six years ago: T199926: html -> wt: Parsoid sometimes trips up on | chars in hrefs

The solution in the patch above is a more general fix which ought to address | and [ and a number of other nasties.

I also checked extlink and autolink serialization and they were, in fact, already correct. It was just wikilink serialization which had this corner case for fragments.

Change #1189255 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Ensure wikilink fragments are appropriately escaped

https://gerrit.wikimedia.org/r/1189255

For Tech news:

Anchor links that included the symbol # so as the symbols [ or ] were not encoded, creating an erroneous link. This has been fixed.

Feel free to refine this.

Change #1192167 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a25

https://gerrit.wikimedia.org/r/1192167

For Tech news:

Anchor links that included the symbol # so as the symbols [ or ] were not encoded, creating an erroneous link. This has been fixed.

Feel free to refine this.

"Adding a link in Visual Editor which included the symbol [ or ] after a # created an erroneous link in the wikitext. This has been fixed."

Change #1192167 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a25

https://gerrit.wikimedia.org/r/1192167