I have just received this notification of this change, about the edition of a flow board desc. The text of the notification is not displayed properly, it would seem that a conversion is missing.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Use Remex in Sanitizer::stripAllTags() | mediawiki/core | master | +57 -12 | |
SanitizerTest: Add tests for stripAllTags | mediawiki/core | master | +28 -0 |
Related Objects
Event Timeline
Checked in testwiki (wmf.7). The issue is not reproducible there. There might be some other content on the Flow board that interfere with proper rendering. What I did
- Edit empty Flow board description with
<small>[[Discussion_utilisateur:Dartyytrad/Archive_1|Consulter les archives]]</small>
The diff is:
- The received notification looks properly rendered:
It's probably because of the template at the beginning of the board description in question: {{User:irønie/tamago|30|08|2007|80|Kwiki}}.
(disregard, it looks slightly different but it isn't fundamentally different).
I've figured out why this happens. The Parsoid HTML for a template transclusion where one of the parameters contains an HTML tag looks as follows: (the wikitext for this example was {{echo|<small>[[Foo|Bar]]</small>}} Whee!)
<p data-parsoid='{"dsr":[0,41,0,0]}'> <small about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<small>[[Foo|Bar]]</small>"}},"i":0}}]}' > <a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a> </small> Whee! </p>
Note that in data-mw.parts[0].params[1].wt, the HTML tags are encoded as <small> and </small>: the opening angle bracket is escaped but the closing angle bracket isn't.
This seems to really confuse Sanitizer::stripAllTags():
> $content = '<p data-parsoid=\'{"dsr":[0,41,0,0]}\'><small about="#mwt1" typeof="mw:Transclusion" data-parsoid=\'{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}\' data-mw=\'{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<small>[[Foo|Bar]]</small>"}},"i":0}}]}\'><a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a></small> Whee!</p>'; > $content2 = '<p data-parsoid=\'{"dsr":[0,41,0,0]}\'><small about="#mwt1" typeof="mw:Transclusion" data-parsoid=\'{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}\' data-mw=\'{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"<small>[[Foo|Bar]]</small>"}},"i":0}}]}\'><a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a></small> Whee!</p>'; > echo Sanitizer::stripAllTags($content); [[Foo|Bar]]</small>"}},"i":0}}]}'>Bar Whee! > echo Sanitizer::stripAllTags($content2); Bar Whee!
So it appears this is a bug in MW core's sanitizer. I'll investigate more later.
Sanitizer.php seems to simply assume that all instances of < and > are always encoded. That's a somewhat reasonable assumption IMO, since the only software that I'm aware of that fails to encode these is Parsoid . It would probably be difficult to fix the Sanitizer, especially since Parsoid outputs < and > in an unbalanced way (< is encoded but > is not); the Sanitizer code would have to be aware of quotes and I fear it'd basically turn into half an HTML parser.
Why are we feeding Parsoid output into Sanitizer.php::stripAllTags() ?
We do have real HTML parsers, you could simply ask one of them to do the job properly...
In that case why have Sanitizer::stripAllTags() at all, and why not have it wrap around a proper HTML parser?
In general, I would try to feed only PHP parser output to the PHP Sanitizer, and only Parsoid output to Parsoid's Sanitizer. The two components of each parser are too tightly intertwined for any warranty to be offered when combining them willy-nilly. (A long-term goal of mine is to write a proper parser-independent specification for the sanitizer...)
Anyway, as part of T89331 you have a real HTML parser in core: RemexHTML. Based on https://github.com/wikimedia/mediawiki-libs-RemexHtml/blob/master/bin/test.php it appears that something similar to this ought to to the job "properly":
class StripTagHandler implements RemexHtml\Tokenizer\TokenHandler { private $text=""; function characters( $text, $start, $length, $sourceStart, $sourceLength) { $this->text .= substr($text, $start, $length); } }
Change 391347 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] SanitizerTest: Add tests for stripAllTags
Change 391348 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] [WIP] Use Remex in Sanitizer::stripAllTags()
Change 391348 merged by jenkins-bot:
[mediawiki/core@master] Use Remex in Sanitizer::stripAllTags()
Checked in betalabs for some combinations/modifications of the given problematic edit. All seems to be parsed correctly.
Specifically, for the text {{Echo|<small>[[Foo|Bar]]</small>}}, testwiki (wmf.8) would display:
With the fix, betalabs displays it properly: