Page MenuHomePhabricator

New wikitext editor should anchordecode internal links
Closed, InvalidPublic1 Estimated Story Points

Description

Steps to reproduce: paste a link to https://hu.wikipedia.org/wiki/Foo#b.C3.A1r into a huwiki editbox.
Expected result: [[Foo#bár]]
Actual result: [[Foo#b.C3.A1r]]
They both work, but the second one is rather ugly.
(Note that I could not not reproduce this on enwiki, which just pastes the URL as it is. Does it use a newer version of the editor?)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdforrester-WMF set the point value for this task to 1.
Jdforrester-WMF moved this task from To Triage to TR1: Releases on the VisualEditor board.
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Is this based on the browser's language locale rules?

No, anchorencoding is something MediaWiki does to transform [[Foo#bár]] into the URL https://hu.wikipedia.org/wiki/Foo#b.C3.A1r. It's kind of like percent encoding (but with an added layer of stupidity): encode to UTF-8, then encode bytes not allowed in the HTML4 ID attribute charset to allowed ones. This feature request is about NWE (and probably VE?) undoing that.

Uhh, I of course meant to say Foo#b.C3.A1r is the ugly one.

You can see how anchor encoding works in Sanitizer::escapeId and CoreParserFunctions::anchorEncode.

This is a trade-off as the ugly version is somewhat more robust: anchor encoding is not entirely reversible, for two reasons:

  • anchor encoding leaves . characters alone, so if the section title itself contains anchorencode-like characters (ie. the section title is == b.C3.A1r ==), anchor decoding will break it.
  • anchors are deduplicated. If the article has two sections called bár, the anchor for the first one will be b.C3.A1r and the anchor for the second one will be b.C3.A1r_2, so anchor decoding that will fail. This would make it impossible to insert links to the second instance of an identical section title. (Even more confusing when you have three sections called bár, bár and bár 2, in which case the anchors will be b.C3.A1r, b.C3.A1r_2 and b.C3.A1r_2_2 and anchor-decoding the second one will give you a working link to the third section.)

Now that I wrote it down, this does not seem like a very good trade-off... unless the editor can look up the target page and check which section is being referenced, which would make it significantly more complex.

None of these functions are available client-side, I assume?

There is mw.util.escapeId (the more complex stuff in the parser / parser function would not make much sense client-side), but you want the opposite direction, I don't think that exists even on the server side.

T75092: Anchors to section names for non-ASCII letters are encoded in the URL/T152540: Migrate to HTML5 section ids is also relevant to this task (or will be, eventually).

Jdforrester-WMF lowered the priority of this task from Medium to Lowest.Mar 21 2017, 11:22 PM

If Foo#bár is used as link target in VisualEditor, this should be preserved and make its way into wikitext. I believe that is already the case.

The problem is that doing that essentially requires manually copying the text (not the wikitext or html) from a page's heading (e.g. from its table of contents?) and paste it in the link target input field after a # (hash) symbol.

A much easier thing to do for users is going to the target page, clicking on the section in the TOC, and copying/pasting the link, e.g. https://hu.wikipedia.org/wiki/Foo#b.C3.A1r.

At that point, VisualEditor only has two bad choices: Produce an external link, or produce an internal link, based on the known article-path pattern. It currently does the latter and produces a link to [[Foo#b.C3.A1r]] which is naturally also supported in wikitext (naturally, since the escaping is idempotent).

However, at this point we've already lost. There is no way back at this point given the encoding cannot reliably be reversed. Neither VisualEditor nor Parsoid can get the canonical name from this.

However, one thing we can do is discourage users from copying/pasting full links by giving them a better user interface in the first place.

VisualEditor's link editor could be made to support section linking. E.g. when typing Foo, it would autocomplete with the available sections on that page (by using the API somehow). At this point, we have all the information to make the best possible link.

Example from Google Docs (This example is for the current document only, but it shows the idea):

Screen Shot 2017-06-23 at 19.02.34.png (652×890 px, 56 KB)

The new html5 fragment escaping from https://gerrit.wikimedia.org/r/362326 (fd6e9ef2d4) is not idempotent anymore: bár will encoded to b%C3%A1r and b%C3%A1r will be encoded to b%25C3%25A1r.

Is this still an issue, now that HTML5 section IDs have landed? I bet this task could be closed...

I can't get NWE to decode links at all now, they are pasted as is. VE decodes links (which use the new section encoding) correctly though.

WikiEditor can detect internal links from URLs and convert them to internal links:

The section Bar on the page Foo of the fictive wiki www.example.org has the URL https://www.example.org/wiki/Foo#Bar. You are in the WikiEditor on https://www.example.org/w/index.php?title=Sandbox&action=edit. Click on the insert link button, insert the URL https://www.example.org/wiki/Foo#Bar into Target page or URL and click on Insert link. The following question is shown:

The URL you specified looks like it was intended as a link to another wiki page. Do you want to make it an internal link?

Click on Internal link. The value in Target page or URL is converted from https://www.example.org/wiki/Foo#Bar to Foo#Bar. The URL is successfully converted to an internal link.

The section Bár on the page Fóo of the fictive wiki www.example.org has the URL https://www.example.org/wiki/F%C3%B3o#B%C3%A1r. You are in the WikiEditor on https://www.example.org/w/index.php?title=Sandbox&action=edit. Click on the insert link button, insert the URL https://www.example.org/wiki/F%C3%B3o#B%C3%A1r into Target page or URL and click on Insert link. The following question is shown:

The URL you specified looks like it was intended as a link to another wiki page. Do you want to make it an internal link?

Click on Internal link. The value in Target page or URL is converted from https://www.example.org/wiki/F%C3%B3o#B%C3%A1r to F%C3%B3o#B%C3%A1r and an error message The requested page title contains invalid characters: "%C3". is shown.

On converting an URL to an internal link the URL decoding of the title and the URL decoding of the anchor is missing. This is a valid bug.

On converting an URL to an internal link the URL decoding of the title and the URL decoding of the anchor is missing. This is a valid bug.

A different one, though? This one was about the new wikitext editor (VisualEditor in source mode), not WikiEditor.