Page MenuHomePhabricator

HTML entity in wikitext are being over parsed
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create text such as "–"

What happens?:
*This is output as –

What should have happened instead?:
*This should be output as "–"

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

See https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=1191182993#HTML_entity_over-parsing

Event Timeline

TheDJ triaged this task as Unbreak Now! priority.EditedDec 22 2023, 9:46 AM
TheDJ edited subscribers, added: matmarex, TheDJ; removed: Batorsz.

I'm marking this as UBN.

This is an unexpected change to wikitext language model, just before Christmas break and deploy break. That seems a pretty undesirable situation.

The issue is as follows: instead of going through a regex to do the deduplicating of styles, we pass the whole document through Remex, which apparently at some points decides to interpret – as −. This is indeed caused by the above patch, which I believe is safe to revert (pages parsed in the meantime may require cache purging for the fix to go through.)
Interestingly, the issue does NOT trigger on parsoid rendering, which I suspect may have something to do with " $options['isParsoidContent'] ?? false" that sets thing to html5format vs not in remex.

I would suspect something fishy in remex to be investigated, but the quick fix is probably to revert the patch.

Change 985120 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Revert "Use Remex for DeduplicateStyles transform"

https://gerrit.wikimedia.org/r/985120

We should probably have a parser test with the – combo btw. I did a quick grep, and couldn't find any test case that had an escaped amp followed by an entity name (which makes sense, cause otherwise we would have caught this of course). But also entity forms in general seem not well guarded. A tests/parser/entities.txt might make sense after this.

Agreed on the fact it should exist; not _entirely_ sure about the fact it would have caught it. Will have a look.

Change 985120 merged by jenkins-bot:

[mediawiki/core@master] Revert "Use Remex for DeduplicateStyles transform"

https://gerrit.wikimedia.org/r/985120

Change 985033 had a related patch set uploaded (by Reedy; author: Isabelle Hurbain-Palatin):

[mediawiki/core@wmf/1.42.0-wmf.10] Revert "Use Remex for DeduplicateStyles transform"

https://gerrit.wikimedia.org/r/985033

Change 985033 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.10] Revert "Use Remex for DeduplicateStyles transform"

https://gerrit.wikimedia.org/r/985033

Mentioned in SAL (#wikimedia-operations) [2023-12-22T13:45:13Z] <reedy@deploy2002> Finished scap: T353920 (duration: 08m 02s)

The reverting patch is deployed, my own basic tests seem to confirm that the issue is resolved. Affected pages might require a reparse to display correctly.