Page MenuHomePhabricator

Bad render of notifications about edition of flow board description
Closed, ResolvedPublic

Description

I have just received this notification of this change, about the edition of a flow board desc. The text of the notification is not displayed properly, it would seem that a conversion is missing.

bug flow notif texte (123×877 px, 15 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Checked in testwiki (wmf.7). The issue is not reproducible there. There might be some other content on the Flow board that interfere with proper rendering. What I did

  1. Edit empty Flow board description with
<small>[[Discussion_utilisateur:Dartyytrad/Archive_1|Consulter les archives]]</small>

The diff is:

Screen Shot 2017-11-08 at 12.39.07 PM.png (412×818 px, 70 KB)

  1. The received notification looks properly rendered:
    Screen Shot 2017-11-08 at 12.38.51 PM.png (255×518 px, 41 KB)

It's probably because of the template at the beginning of the board description in question: {{User:irønie/tamago|30|08|2007|80|Kwiki}}.

Yup, that's it. I just made this edit and got this notification:

flow-notif-wikitext.png (105×502 px, 13 KB)

...but the one I got for an edit to my talk page looks different:

flow-notif-wikitext-usertalk.png (91×508 px, 13 KB)

...but the one I got for an edit to my talk page looks different:

flow-notif-wikitext-usertalk.png (91×508 px, 13 KB)

(disregard, it looks slightly different but it isn't fundamentally different).

I've figured out why this happens. The Parsoid HTML for a template transclusion where one of the parameters contains an HTML tag looks as follows: (the wikitext for this example was {{echo|<small>[[Foo|Bar]]</small>}} Whee!)

<p data-parsoid='{"dsr":[0,41,0,0]}'>
    <small about="#mwt1" typeof="mw:Transclusion"
        data-parsoid='{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}'
        data-mw='{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"&lt;small>[[Foo|Bar]]&lt;/small>"}},"i":0}}]}'
    >
            <a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a>
    </small>
    Whee!
</p>

Note that in data-mw.parts[0].params[1].wt, the HTML tags are encoded as &lt;small> and &lt;/small>: the opening angle bracket is escaped but the closing angle bracket isn't.

This seems to really confuse Sanitizer::stripAllTags():

> $content = '<p data-parsoid=\'{"dsr":[0,41,0,0]}\'><small about="#mwt1" typeof="mw:Transclusion" data-parsoid=\'{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}\' data-mw=\'{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"&lt;small>[[Foo|Bar]]&lt;/small>"}},"i":0}}]}\'><a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a></small> Whee!</p>';
> $content2 = '<p data-parsoid=\'{"dsr":[0,41,0,0]}\'><small about="#mwt1" typeof="mw:Transclusion" data-parsoid=\'{"stx":"html","dsr":[0,35,null,null],"pi":[[{"k":"1"}]]}\' data-mw=\'{"parts":[{"template":{"target":{"wt":"echo","href":"./Template:Echo"},"params":{"1":{"wt":"&lt;small&gt;[[Foo|Bar]]&lt;/small&gt;"}},"i":0}}]}\'><a rel="mw:WikiLink" href="./Foo" title="Foo">Bar</a></small> Whee!</p>';

> echo Sanitizer::stripAllTags($content);
[[Foo|Bar]]</small>"}},"i":0}}]}'>Bar Whee!

> echo Sanitizer::stripAllTags($content2);
Bar Whee!

So it appears this is a bug in MW core's sanitizer. I'll investigate more later.

Sanitizer.php seems to simply assume that all instances of < and > are always encoded. That's a somewhat reasonable assumption IMO, since the only software that I'm aware of that fails to encode these is Parsoid . It would probably be difficult to fix the Sanitizer, especially since Parsoid outputs < and > in an unbalanced way (< is encoded but > is not); the Sanitizer code would have to be aware of quotes and I fear it'd basically turn into half an HTML parser.

Why are we feeding Parsoid output into Sanitizer.php::stripAllTags() ?

We do have real HTML parsers, you could simply ask one of them to do the job properly...

Why are we feeding Parsoid output into Sanitizer.php::stripAllTags() ?

We do have real HTML parsers, you could simply ask one of them to do the job properly...

In that case why have Sanitizer::stripAllTags() at all, and why not have it wrap around a proper HTML parser?

In general, I would try to feed only PHP parser output to the PHP Sanitizer, and only Parsoid output to Parsoid's Sanitizer. The two components of each parser are too tightly intertwined for any warranty to be offered when combining them willy-nilly. (A long-term goal of mine is to write a proper parser-independent specification for the sanitizer...)

Anyway, as part of T89331 you have a real HTML parser in core: RemexHTML. Based on https://github.com/wikimedia/mediawiki-libs-RemexHtml/blob/master/bin/test.php it appears that something similar to this ought to to the job "properly":

class StripTagHandler implements RemexHtml\Tokenizer\TokenHandler {
  private $text="";
  function characters( $text, $start, $length, $sourceStart, $sourceLength) {
    $this->text .= substr($text, $start, $length);
  }
}

Change 391347 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] SanitizerTest: Add tests for stripAllTags

https://gerrit.wikimedia.org/r/391347

Change 391348 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] [WIP] Use Remex in Sanitizer::stripAllTags()

https://gerrit.wikimedia.org/r/391348

Change 391348 merged by jenkins-bot:
[mediawiki/core@master] Use Remex in Sanitizer::stripAllTags()

https://gerrit.wikimedia.org/r/391348

Checked in betalabs for some combinations/modifications of the given problematic edit. All seems to be parsed correctly.

Specifically, for the text {{Echo|<small>[[Foo|Bar]]</small>}}, testwiki (wmf.8) would display:

Screen Shot 2017-11-16 at 4.23.15 PM.png (195×660 px, 41 KB)

With the fix, betalabs displays it properly:

Screen Shot 2017-11-16 at 4.35.35 PM.png (199×586 px, 31 KB)