Page MenuHomePhabricator

Sanitizer::removeHTMLtags doesn't close tags correctly when $wgUseTidy is enabled
Closed, ResolvedPublic

Description

When a construct such as {{DISPLAYTITLE:<b>Foo bar<b>}} is used, the unbalanced tags leak and affect the rest of the page, as well as MediaWiki's interface.

Version: 1.24rc
URL: https://en.wikipedia.org/wiki/User:Jackmcbarn/65747

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:22 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz65747.
bzimport added a subscriber: Unknown Object (MLST).

(hmm, why did this submit early?)

When a construct such as {{DISPLAYTITLE:<b>Foo bar<b>}} is used, the unbalanced tags leak and affect the rest of the page, as well as MediaWiki's interface.

Maybe an extension problem on wikipedia? I have tested on a clean 1.24wmf5 version without anything and AllowDisplayTitle=true and RestrictDisplayTitle=false, but if i use displytitle parser hook in te same format like given (unclosed b-tag) i become only the plain title without bold (it's written bold, if i close the tag correctly). So, on a fresh mediawiki installation i can not reproduce this error (correct me, if i does everything false :)).

I've tracked this problem down to Sanitizer::removeHTMLtags and $wgUseTidy. When $wgUseTidy is disabled (as it is on default MediaWiki installations), calling Sanitizer::removeHTMLtags on "<b>test<b>" correctly yields "<b>test&lt;b&gt;</b>". However, when it is enabled, calling it on "<b>test<b>" yields "<b>test<b>" back unchanged, which is incorrect and the cause of this bug.

Related to this, I've found that if you save the page with the following code on enwiki, it results in some very strange formatting:

<table> <table>Some text (The whitespace appears to be important.) See https://en.wikipedia.org/w/index.php?title=User:Mr._Stradivarius/Sandbox&oldid=630607976 for an example. You might need to try the link a few times; it appears to be an intermittent behaviour. Additionally, if this code is used on preview, it sometimes causes the input to be lost. I've opened bug 72343 about this.

Created attachment 16838
Strange formatting after using double unclosed table tags

Attached:

MediaWiki_double_table_tag_screenshot.png (652×1 px, 90 KB)

Izno added subscribers: ssastry, Izno.

The original "bug" is something Remex is still allowing. I wonder if it should be closed invalid due to the manual work all of those Wikimedians slaving over Linter. Should Sanitizer be trying to make fixes like the one it apparently made (and I would guess it still makes)? If it shouldn't, should that be a separate task somewhere to remove that from Sanitizer?

I think Mr. S's problem is probably reasonably resolved by the Linter categories related to table tags.

I think the correct fix is actually something like T114445: [RFC] Balanced templates, which will ensure that templates (or magic words like DISPLAYTITLE) don't "leak".

Another example of this type of bug: unclosed <font> tag affects the entire page due to leaking out of DISPLAYTITLE.

https://bcl.wikipedia.org/wiki/Paragamit:Cyrus_noto3at_bulaga

Would be harmful to Wikipedia design if more people get hold of this kind of bug.

Change 756116 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] WIP: Implement Sanitizer::removeHTMLtags() using Remex

https://gerrit.wikimedia.org/r/756116

sbassett closed subtask Restricted Task as Resolved.Jan 24 2022, 7:33 PM

Change 756116 merged by jenkins-bot:

[mediawiki/core@master] Add Sanitizer::removeSomeTags() which uses Remex to tokenize

https://gerrit.wikimedia.org/r/756116

cscott claimed this task.