Page MenuHomePhabricator

Parsoid does not convert underscores to spaces for interwiki links
Open, Needs TriagePublic

Description

As a deliberate design decision, Parsoid tries hard not to normalize titles in interwiki links because they are by definition links to "other wikis" and not all "other wikis" are MediaWiki wikis. For example, the meatball interwiki points to http://www.usemod.com/cgi-bin/mb.pl?$1 (a usemod wiki), and PMID/RFC magic links were treated as interwiki ilnks to non-wikis. Because we can't assume that the target wiki uses the same title normalization as MediaWiki does, we avoid doing as much normalization as possible and just use the title as specified by the author. In particular: spaces are left as-is, and no capitalization of the initial letter is done.

Language links are treated as a type of interwiki links, so are similarly un-normalized -- although the metadata information in ParserOutput is normalized.

The legacy parser applies MediaWiki title normalization to all link targets, including interwiki links. In particular, spaces are converted to underscores and the first letter is capitalized (EDIT: I don't think this is true, see Image with link parameter, interwiki target test case). This led to the incompatibility in T376043, and may cause other issues in the future.

This task stands in for a discussion of converting Parsoid behavior to match the behavior of the legacy parser.

Pros

  • consistency with legacy parser behavior and content, gadgets, third party tools, etc;
  • the title processing code gets a little more tidy and DRY'ed out.
  • avoids unnecessary redirects when following interwiki links
  • Third party clients get closer to a "canonical title" (still no guarantee it's not a redirect, though)

Cons

  • As described above, interwikis are not guaranteed to be MediaWiki, so there may be titles on the external wiki for which interwiki syntax can't be used (they'd have to use external links) and this knowledge would also have to be encoded into Parsoid/VisualEditor in some way since we autoconvert external links to interwiki links when possible
  • At this point changing interwiki link encoding would be a breaking change to the MediaWiki DOM Spec and would require coordination with third party users of Parsoid output.

We discussed this issue at the Content-Transform-Team tech forum on 2024-11-12 and decided to leave the current behavior as-is for now. This task is a placeholder for a future reconsideration of that decision.

Event Timeline

Change #1087215 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Make interwiki and language link hrefs consistent with wikilink hrefs

https://gerrit.wikimedia.org/r/1087215

Another con: An article on another wiki that you've already visited doesn't get :visited styling when the href contains spaces instead of underscores.

Maybe the interwiki table needs a switch for whether or not the target title should be treated like a MediaWiki title? (After all, the vast majority of the interwiki link targets that are actually used – especially when you consider interlanguage links – go to MediaWiki wikis.)