Page MenuHomePhabricator

HtmlHelper::modifyElements(…, $html5format = false) incorrectly encodes HTML entities
Closed, ResolvedPublic

Description

HtmlHelper::modifyElements(…, $html5format = false) incorrectly encodes HTML entities.

We added this option in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/979165, used it in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/977814, then found the problem in T353920 and had to revert that patch (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/985033).

We should either resolve the problem (and maybe reinstate the reverted patch), or also revert the patch that added the option, since it can't be used safely.

Event Timeline

For the $html5format = false mode, we used RemexCompatFormatter, and therein lies the problem: it is not documented, but that class requires the 'ignoreCharRefs' option in Tokenizer to be used. The option – and this is documented – causes runs of text (in attributes and text nodes) to contain unexpanded character references (HTML entities). RemexCompatFormatter expects that, and so when we were passing it text with decoded entities, it did not encode them, causing the bug.

Change 989258 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] HtmlHelper: Fix entity encoding when $html5format = false

https://gerrit.wikimedia.org/r/989258

Change 989258 merged by jenkins-bot:

[mediawiki/core@master] HtmlHelper: Fix entity encoding when $html5format = false

https://gerrit.wikimedia.org/r/989258