Sanitizer fails to include image tags if incorrectly written as closed/void tags <img src= />
Open, NormalPublic

Description

Copied from the description of bug 35002 :

I noticed that the Sanitizer function fails in the cases where the image tag is written in XHMTL-style as a closed tag <img src='http://image-url' />".

The net is full of reports of such things (XHTML vs. HTML4.1 vs. HTML5)
read for example this http://tiffanybbrown.com/2011/03/23/html5-does-not-allow-self-closing-tags/ .

This is still a valid bug, and a big problem for my RSS extension if the adminstrator expressly allows the rendering of <img> tags.

In this case the Sanitizer is called as usual but gets additional food:

$extraInclude[] = "img";

> but *fails* to allow the img, escapes it still, which is wrong in that case <

In other words, and to make it unambiguously clear:

The sanitizer does still its work and sanitizes (fallback: it is still secure), even if you want it *not* to santize img tags.

Because I have no influence on the composition of images tags of the incoming source (RSS feed), and some of these source do not obey the rules for image tags, I ask the MediaWiki Sanitizer specialist to fix this specific problem of "closed image tags".

How to reproduce:

(excerpt and comnstructed example; part of E:RSS)

$extraInclude = array();
$extraExclude = array( "iframe" );

$extraInclude[] = "a";
$extraInclude[] = "img";

$text = '<img src="http://tctechcrunch2011.files.wordpress.com/2013/03/yodlee_logo_final_rgb_lrg.jpg">';

$ret = Sanitizer::removeHTMLtags( $text, null, array(), $extraInclude, $extraExclude );

wfDebug( "RSS: after Sanitizer::removeHTMLtags:text:" . print_r( $text, true ) . "\n" );
wfDebug( "RSS: after Sanitizer::removeHTMLtags:extraInclude: " . print_r( $extraInclude, true ) . "\n" );
wfDebug( "RSS: after Sanitizer::removeHTMLtags:extraExclude: " . print_r( $extraExclude, true ) . "\n" );
wfDebug( "RSS: after Sanitizer::removeHTMLtags:ret: " . print_r( $ret, true ) . "\n" );


Version: 1.22.0
Severity: normal
URL: http://openid-wiki.instance-proxy.wmflabs.org/wiki/Main_Page#RSS
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=35002

Details

Reference
bz46443
bzimport raised the priority of this task from to High.
bzimport set Reference to bz46443.
bzimport added a subscriber: Unknown Object (MLST).

see openid-wiki.instance-proxy.wmflabs.org/wiki/Main_Page#RSS how it looks like - no images, but escaped image tags

see http://openid-wiki.instance-proxy.wmflabs.org/wiki/Main_Page#RSS for an exmaple how it looks like (img tags are shown escaped, but not executed/rendered)

see http://feeds.feedburner.com/TechCrunch/ as example for such an RSS feed with closed img tags.

(In reply to comment #0)

How to reproduce:
================================================

....

$extraInclude[] = "img";
$text = '<img
src="http://tctechcrunch2011.files.wordpress.com/2013/03/
yodlee_logo_final_rgb_lrg.jpg">';

$ret = Sanitizer::removeHTMLtags( $text, null, array(), $extraInclude,
$extraExclude );

In my example above, the img tag was manually repaired, during my tests, and then I copied that too early to this bug.

What I meant as test case is:

$text = '<img
src="http://tctechcrunch2011.files.wordpress.com/2013/03/yodlee_logo_final_rgb_lrg.jpg" />';

i.e. with the closing />

The effect is immediately apparent: the "open" image <img src=... > will be rendered, the "closed" <img src=... /> will not. Again, I am only talking about the case $extraInclude[] = "img";

HTML5 has nothing to do with this. Of course HTML5 does not support self-closing <span />, that's XML. Such a feature has never existed in HTML and it's stupid to consider that HTML5 would change that when browsers have never supported that in HTML.

This is of course completely irrelevant to the bug report. Because besides this being an issue with the sanitizer and not the browser's HTML parser <img> IS a void tag and the browser parser would treat <img /> the same way as <img>.


Now for the bug, testing this myself:

$extraInclude = array( 'a', 'img' );

$text = '<img src="http://example.com/test.png">';
var_dump( Sanitizer::removeHTMLtags( $text, null, array(), $extraInclude ) );

string(50) "<img src="http&#58;//example.com/test.png"></img>\n"

$text = '<img src="http://example.com/test.png" />';
var_dump( Sanitizer::removeHTMLtags( $text, null, array(), $extraInclude ) );

string(47) "&lt;img src="http://example.com/test.png" /&gt;"

I see that the real issue is we simply have no real support for void tags. Since none of the tags we usually whitelist are void tags. We'll have to add support for that to the sanitizer.

In the non-void case I think that the /> behaviour is probably correct. We don't support self closing <div />'s. And turning <div /> into <div> like the HTML parser does is unexpected to the user. The sanest behaviour is to completely reject it so the user clearly can see something is wrong.

(In reply to comment #4)
I cannot change the composition of incoming feeds, see http://feeds.feedburner.com/TechCrunch/ as example for such an RSS feed with closed img tags.

People are complaining that the RSS extension is not working, or broken, but I fully rely on the Sanitizer! So that Sanitizer needs a correction - only allow the closing image tag also.

That's the message, and there is nothing to discuss.

if the Sanitizer is not fixed, the extension will not be working for feeds using closing image tags, basta.

hello, I do need the function as reported with this bug: that the sanitizer accepts image tags also when they have a closing " />". when "img" tags are part of the $extraInclude , i.e. allowed.

I would be happy, if this finds its way into core.

Aklapper lowered the priority of this task from High to Normal.Feb 13 2016, 12:40 PM
Aklapper added a subscriber: Aklapper.

It really would be nice to note in the RSS extension page that extracting any image information from the RSS feed is not possible.

I've spent quite a bit of time trying to extract the image html from the "description" tag of the RSS field and display it (using widgets or iframes or raw HTML). I think this bug is probably why it is not possible? Either the bug should be fixed (seems unlikely after 3 years) or a note on the extension page indicating that image information cannot be extracted.

I'm a bit frustrated about this - I've probably spent 5-8 hours trying to achieve getting the documented feature working and I'm not the only one who's frustrated (see here). I'll edit the extension page to note that at least since 2013 this feature is broken due to the sanitizer performance.

Happy to be told off if this is not appropriate.

@Dan.mulholland

RSS feeds are coming in a "zoo" of variants - please feel free to submit a pull request if you have a working fix for your problem. I can only tell you that image and image tags are not only difficult but can be dangerous for the page where these rendered.