Page MenuHomePhabricator

Section titles containing a question mark should be URL encoded in table of contents links
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

What should have happened instead?:

image.png (442×1 px, 47 KB)

  • Oh, this is happening on Phabricator too. See above.

image.png (131×1 px, 24 KB)

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

I misread, I thought the linked ticket T95473 was resolved, in reality it's still open.

I'll still leave this one open though, since that one is about interwiki links, and this one is about regular wikilinks.

@Novem_Linguae can you assist me on how the links are being converted and which section to look out for, as on searching I figured out the parser class is responsible for converting syntax to html if that's right?

Not sure. I'd have to dig into the code.

A good approach to this ticket might be to add a parser test to mediawiki/tests/parser/* somewhere that is something like...

!! test
URL-encode punctuation in hyperlinks (T326365)
!! wikitext
__TOC__

== Test? ==
!! html
paste the current parser output, then tweak the question mark in the TOC <a href=""> to be %3F instead of ?
!! end

Then start tweaking the parser code until the test passes.

See also T50940 for the same problem, but with page titles.

Change #1186601 had a related patch set uploaded (by Anshuman.rai.1004; author: Anshuman.rai.1004):

[mediawiki/services/parsoid@master] T326365: Percent-encode fragment identifiers in TOC links

https://gerrit.wikimedia.org/r/1186601

I don't know enough about Parsoid to review this. I've added a different reviewer to your patch that should be able to help.

I believe the correct place to start would be patching Sanitizer::escapeIdInternalUrl() in mediawiki-core, which currently reads:

	private static function escapeIdInternalUrl( string $id, string $mode ): string {
		$id = self::escapeIdInternal( $id, $mode );
		if ( $mode === 'html5' ) {
			$id = preg_replace( '/%([a-fA-F0-9]{2})/', '%25$1', $id );
		}
		return $id;
	}

This is extremely selective about %-encoding URLs, basically *only* encoding the % character itself in the "modern" html5 anchors. You are proposing to also urlencode ? . and ! (what else?), but only when they appear at the end of a link. That could be added to the implementation of escapeIdInternalUrl() above.

And then, as @Novem_Linguae suggests, a parser test should also be added for this behavior, and a separate patch prepared to change Parsoid's copy of Sanitizer::escapeIdInternalUrl() (in parsoid:src/Core/Sanitizer.php) the same way.

Change #1187098 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] WIP: Protect trailing punctuation in titles and anchor fragments

https://gerrit.wikimedia.org/r/1187098