Page MenuHomePhabricator

cc.resolveHtmlEntities() should be excluded inside <syntaxhighlight> tag
Closed, ResolvedPublicBUG REPORT

Description

cosmetic_changes.resolveHtmlEntities() replacemens should be excluded within <pre> tag as well as inside <source>, <syntaxhighlight> and <nowiki>. Currently only <code> is excluded. See also this request

Event Timeline

Xqt created this task.Jun 3 2020, 1:42 PM
Restricted Application added a project: Pywikibot. · View Herald TranscriptJun 3 2020, 1:42 PM
Restricted Application added subscribers: pywikibot-bugs-list, jeblad, Aklapper. · View Herald Transcript
Xqt triaged this task as Low priority.Jun 3 2020, 1:45 PM
Xqt added a project: good first task.
Xqt changed the subtype of this task from "Task" to "Feature Request".

This is justified for syntaxhighlight/source but actually not for nowiki, pre, code.

This is justified for syntaxhighlight/source but actually not for nowiki, pre, code.

Strange because the cosmetic_changes.resolveHtmlEntities() calls html2unicode(text, ignore=ignore, exceptions=['code']) where the exception list contains 'code' and ampersand is in the ignore list.

The screenshot does not demonstrate what replacements bots do but how MediaWiki treats HTML entities inside these tags.
I believe 'code' was a mistake and should have originally been 'source'.

@matej_suchanek Could you try a different character and post results? Ampersand might be an exception on both sides (both PWB and MW)

ndash and szlig:


And numerical (unicode, hex-unicode)?

Yeah, numerical work the same, just tested. Okay, that means only syntaxhighlight needs to have exception

Remember that the following code is sometimes used: &amp;nbsp; in <code> or similar tags, to intentionally display an HTML entity code &nbsp;.

This is sometimes used in documentation in code to copy, or to explain behavior.

The bot should not modify in this case.

This is avoided by blacklisting &amp; and others such as &gt; or &lt; from replacement.

Change 609297 had a related patch set uploaded (by Matěj Suchánek; owner: Matěj Suchánek):
[pywikibot/core@master] [bugfix] Avoid HTML entity substitution in <syntaxhighlight>

https://gerrit.wikimedia.org/r/609297

matej_suchanek renamed this task from cc.resolveHtmlEntities() should be excluded inside <pre> tag to cc.resolveHtmlEntities() should be excluded inside <syntaxhighlight> tag.Jul 3 2020, 9:58 AM
matej_suchanek changed the subtype of this task from "Feature Request" to "Bug Report".
Xqt added a comment.Jul 3 2020, 2:06 PM

The screenshot does not demonstrate what replacements bots do but how MediaWiki treats HTML entities inside these tags.
I believe 'code' was a mistake and should have originally been 'source'.

Don't think so, see T57222

Well, I have already demonstrated how MediaWiki behaves. If you take a look at the diffs in that task description, you will see the bot also replaced &amp; -> &. That is certainly unwanted and there is a regression test which guards against this. But <code>...</code> does not escape HTML entity, so there is no point in excluding it (unless we want to give users false sense of security).

Xqt added a comment.Jul 4 2020, 2:48 PM

I see, the point is that ignore list prevents from replacing already.

Change 609297 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Avoid HTML entity substitution in <syntaxhighlight>

https://gerrit.wikimedia.org/r/609297

Xqt closed this task as Resolved.Jul 4 2020, 2:56 PM
Xqt claimed this task.