Page MenuHomePhabricator

section name "F&GT" shows as "F<" in table of contents, edit summaries, and element ID
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Write this to a sandbox page, save:
__TOC__

== F&GT ==
Test ~~~~
  • Then write a reply using DiscussionTools (not regular editing), save

What happens?:

  • table of contents says F>
  • DiscussionTools edit summary says /* F> */
  • HTML element ID is F>

image.png (204×296 px, 4 KB)

image.png (124×2 px, 33 KB)

image.png (32×891 px, 4 KB)

What should have happened instead?:

  • table of contents says F&GT
  • DiscussionTools edit summary says /* F&GT */
  • HTML element ID is F&GT

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

  • &gt; is the HTML symbol for > (greater than), but I would not expect it to render without the semicolon. Here there is no semicolon and it still renders.

Event Timeline

DiscussionTools edit summary is based on the HTML element ID, so fixing that should also fix DiscussionTools. (Relevant code is around these functions: https://codesearch.wmcloud.org/deployed/?repos=mediawiki/extensions/DiscussionTools&q=getLinkableTitle)

The bug is partially present in MW 1.39 already: https://patchdemo.wmflabs.org/wikis/1e1cc989d1/wiki/Talk:T355386?useskin=vector-2022 (only TOC text is affected and only in new Vector – later MediaWiki versions increasingly reused the new Vector TOC code for other things, so the issue appears in more places)

The bug is not present when using the new Parsoid parser: https://test.wikipedia.org/w/index.php?title=Talk:T355386&useparsoid=1


  • &gt; is the HTML symbol for > (greater than), but I would not expect it to render without the semicolon. Here there is no semicolon and it still renders.

In HTML, &gt without the semicolon is also a valid entity for >, but in wikitext it's not – there's a note about that here: https://gerrit.wikimedia.org/g/mediawiki/core/+/1d40c4039f7cb720f889fc0687c6622fff40ece7/includes/parser/Sanitizer.php#50

I did a bit of debugging, and it looks like the problem is in Parser::finalizeHeadings(), where the TOC is computed, which assumes that $text contains valid HTML – but in fact, it contains almost HTML, but with wikitext rules for entities. So when it tries to parse the snippet of the heading as HTML (to remove markup that can't be shown in the TOC, e.g. images), it corrupts the entities.

(I recently touched that code in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/975064, which is the reason I looked into this bug report :) but this bad assumption was already there in the previous version of the code.)

I think the missing step is calling tidy() on the snippet before treating it as HTML.

Change 991877 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] Parser: Convert wikitext entities to HTML entities in TOC

https://gerrit.wikimedia.org/r/991877

Change 1002605 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add parser tests for TOC behavior with french spacing and HTML entities

https://gerrit.wikimedia.org/r/1002605

Change 991877 merged by jenkins-bot:

[mediawiki/core@master] Parser: Convert wikitext entities to HTML entities in TOC

https://gerrit.wikimedia.org/r/991877

Change 1002605 merged by jenkins-bot:

[mediawiki/core@master] Add parser tests for TOC behavior with french spacing and HTML entities

https://gerrit.wikimedia.org/r/1002605