Page MenuHomePhabricator

TemplateStyles breaks a paragraph if a file is inserted inline
Closed, ResolvedPublic

Description

If you insert a template with TemplateStyles and add a file inline in the same paragraph, the paragraph breaks with a <link> being a separator between two faulty paragraphs. If no <templatestyles> is present, this type of layout breaks into one paragraph and a file.

This is not a good example of wiki text markup, but since it doesn’t cause issues without TemplateStyles it’s probably better to fix this.

Original report:
https://ru.wikipedia.org/wiki/Википедия:Форум/Технический#шаблон_Comment

Example:
https://ru.wikipedia.org/?oldid=96107598#Вмешательство_правительства

Event Timeline

stjn renamed this task from TemplateStyles breaks the paragraph if inline file is inserted in it to TemplateStyles breaks a paragraph if a file is inserted inline.Nov 6 2018, 9:23 PM

So if I read things correctly, the wikitext of the paragraph is something like

abc {{comment|def}} ghi [[File:Foo.jpg|right|thumb|250px|jkl]]

which expands to

abc <templatestyles src="Template:Comment/styles.css" /><span class="ts-comment-commentedText">def</span> ghi [[File:Foo.jpg|right|thumb|250px|jkl]]

and the HTML is

<p>abc </p><link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r93295199"/><p><span class="ts-comment-commentedText">def</span> ghi </p><div class="thumb tright">(lots of image markup)</div>

That example has

entitled {{Greek|Πιναξ}}. There is no

in the wikitext, which expands to

entitled <templatestyles src="Greek/fonts.css" /><span lang="grc" xml:lang="grc" class="grc">Πιναξ</span>. There is no

and the HTML is

entitled </p><style data-mw-deduplicate="TemplateStyles:r8904693">[lots of CSS]</style><p><span lang="grc" class="grc">Πιναξ</span>. There is no``

but unlike the original example, I can't reproduce it in Special:ExpandTemplates with just this snippet. Maybe the `<ref>` tag that's on the same line is somehow related?

We're getting this problem in the Page namespace sometimes: https://en.wikisource.org/wiki/Page:The_grammar_of_Dionysios_Thrax.djvu/9 but when this page is transcluded, the problem disappears.

Hmm. This looks like a pretty naive implementation. I had expected TemplateStyles to parse out any <templatestyles> elements and inserting them in the <head> element (or through ResourceLoader or something): expanding them inline is going to create loads of problems, of which this is a perfect example.

The details will vary a bit by which version of the HTML spec you look at, but going by HTML 5.2 (December 2017), the content model for <p> elements is "phrasing content" (roughly what pre-5 versions used to call "inline elements"). <style> and almost all <link> elements are "flow content" (roughly analogous to what pre-5 versions called "block elements"), and thus are not allowed inside a <p> element. This causes a well-behaved parser to close the <p> element (insert a </p> end tag), and something similar after the TemplateStyles code (exactly what depends on the specific element hierarchy where this happens).

The safest, and most likely to work in most situations, way to handle this seems to be to emit a <link rel="stylesheet" /> element. As of HTML 5.2 (I haven't checked when this was added), only <link> elements whose rel-attribute has the value "stylesheet" are "body-ok", and the content model for <p> allows <link> elements if they are "body-ok".

<style> elements and <link> with a rel-attribute value of "mw-deduplicated-inline-style" will break markup in lots of cases. And I see no obvious way to work around it for editors.

(the rules and content models involved will be somewhat different in pre-html5 versions, but the general issue will be the same: you can't just throw out <link> and <style> in random places and expect it to work)

This issue is not caused by browser logic, which allows to put <style> anywhere as of HTML 5.2 (and even before then, even if it would’ve been semantically incorrect, it would still work in browsers), it is caused by inaccurate parsing rules of MediaWiki wikitext parser. There’s nothing preventing us to fix those rules, especially since browser logic doesn’t require these bugs to happen.

A solution based on moving TS tags into <head> was rejected previously because it would increase time to show a page. Maybe you’re onto something in that we could potentially move tags before paragraphs and resolve the issue that way (if that’s possible programmatically), but the general tone of your comment suggesting that putting <style> tags outside of <head> is strictly wrong is the wrong one in itself.

@stjn I did not say that <style> outside <head> is wrong (it isn't: <style> is allowed anywhere "flow content" is allowed).

What I said was that <style> and <link> are valid either where "metadata content" or "flow content" is allowed (sections 4.2.6 and 4.2.4 of the HTML 5.2 specification). However, the content model for the <p> element is "phrasing content" (section 4.4.1). "Phrasing content" is defined in section 3.2.4.2.5 and contains neither <style> nor <link>, except in the single specific exception when the <link> element has a rel attribute with the value stylesheet. In this latter case, the <link> element is defined to have the property body-ok (section 4.8.6), which are "allowed in the body", which in turn are allowed in the content model for "phrasing content".

Thus, <style> and <link rel="mw-deduplicated-inline-style"> are both invalid inside a <p> element.

When a parser (whether in a web browser or inside MediaWiki) encounters an element that is not allowed by the context's content model, the prescribed error recovery algorithm is as described in section 8.2.8.2 (strictly speaking it is as described, normatively, in chapter 8.2, but that's several hundred pages of prose description of the algorithm which is useless for our purposes here). In practice, what will happen is that the currently open <p> element gets closed; which will be visible to end users as an unexpected paragraph break.

The short term fix, as best I can tell, is to only output <link rel="stylesheet"> (which is allowed inside a <p> element). This may still cause issues in some contexts, but will definitely be in fewer cases than the current.

The right fix is to extract all these CSS fragments and emit them somewhere we can guarantee that they are valid. The most obvious being as inline <style> element(s) in the <head>, or somewhere before the .mw-page-content, or even at the end of <body> (which has other issues, but is at least guaranteed to be valid).

Parser in the web browser is not the problem, that’s what I’m trying to tell. The proposed strategy with <link rel="stylesheet"> is unfeasible, since the point of inline styles is exactly to move away from the additional browser requests. The strategy with extracting <style> elements in <head> was declined before and for a good reason (increasing the time of first paint). The strategy of putting them in the end of <body> will add flashes of unstyled content.

Did anyone actually tested that first paint time increase?

The original discussion about template styles is T155813: Decide on storage and delivery method for TemplateStyles CSS which discusses these tradeoffs about where <style> tags appear.

But, let us please not complicate this discussion. This is a bug in the paragraph wrapping logic in the parser and that is where it needs to be fixed.

Anomie subscribed.

After some experimentation, the issue seems to be in RemexHtml when it's adding the wrapping <p> tags:

anomie@mwmaint1002:~$ mwrepl enwiki
hphpd> $tidy = new MediaWiki\Tidy\RemexDriver( $wgTidyConfig );
hphpd> =$tidy->tidy( '<p>foo <style>/* ... */</style> bar</p>' );
"<p>foo <style>/* ... */</style> bar</p>"
hphpd> =$tidy->tidy( 'foo <style>/* ... */</style> bar' );
"<p>foo </p><style>/* ... */</style><p> bar</p>"
hphpd> =$tidy->tidy( '<div>foo <style>/* ... */</style> bar</div>' ); 
"<div>foo <style>/* ... */</style> bar</div>"
hphpd> =$tidy->tidy( '<blockquote>foo <style>/* ... */</style> bar</blockquote>' );
"<blockquote><p>foo </p><style>/* ... */</style><p> bar</p></blockquote>"

So any wikitext construct that causes the Parser to not output <p> tags itself around something Remex will add <p> tags to will result in this bug. In T208901#4726406 it's the <div> from the image syntax, while in T208901#4855856 with the <ref> added it's the <div> around the auto-generated reference list that winds up appended to the end of the paragraph there.

After some experimentation, the issue seems to be in RemexHtml when it's adding the wrapping <p> tags:

anomie@mwmaint1002:~$ mwrepl enwiki
hphpd> $tidy = new MediaWiki\Tidy\RemexDriver( $wgTidyConfig );
hphpd> =$tidy->tidy( '<p>foo <style>/* ... */</style> bar</p>' );
"<p>foo <style>/* ... */</style> bar</p>"
hphpd> =$tidy->tidy( 'foo <style>/* ... */</style> bar' );
"<p>foo </p><style>/* ... */</style><p> bar</p>"
hphpd> =$tidy->tidy( '<div>foo <style>/* ... */</style> bar</div>' ); 
"<div>foo <style>/* ... */</style> bar</div>"
hphpd> =$tidy->tidy( '<blockquote>foo <style>/* ... */</style> bar</blockquote>' );
"<blockquote><p>foo </p><style>/* ... */</style><p> bar</p></blockquote>"

So any wikitext construct that causes the Parser to not output <p> tags itself around something Remex will add <p> tags to will result in this bug. In T208901#4726406 it's the <div> from the image syntax, while in T208901#4855856 with the <ref> added it's the <div> around the auto-generated reference list that winds up appended to the end of the paragraph there.

RemexCompatMunger in includes/tidy (not RemexHtml) has to do the paragraph wrapping on bare unwrapped text since doBlockLevels doesn't do the right thing (T134469: doBlockLevels() inserts <p> and </p> randomly with no regard for HTML validity). So, in this case, that code probably needs an exception for style tags with templatestyle attributes. In case it is helpful, that RemexCompatMunger paragraph wrapping logic is effectively the SAX-based implementation of this Parsoid DOM pass.

Arlo's comment here says that RemexCompatMunger treats style as a block tag.

The details will vary a bit by which version of the HTML spec you look at, but going by HTML 5.2 (December 2017), the content model for <p> elements is "phrasing content" (roughly what pre-5 versions used to call "inline elements"). <style> and almost all <link> elements are "flow content" (roughly analogous to what pre-5 versions called "block elements"), and thus are not allowed inside a <p> element.

Browsers use the WHATWG spec, according to which neither <style> nor <link> without stylesheet is allowed in the body. I don't think the W3C version of the spec has much relevance these days. The technically invalid usage of these tags was discussed when the feature was written, but there is no obviously better alternative and <style> or metadata in the body is common practice so browsers are prepared to handle it in practice.

This causes a well-behaved parser to close the <p> element (insert a </p> end tag), and something similar after the TemplateStyles code (exactly what depends on the specific element hierarchy where this happens).

It doesn't. Why would it? Style and link tags get hoisted into the head without otherwise perturbing the document tree. It's not even a parse error. See 8.2.5.4.7 in the W3C spec or 12.2.6.4.7 in the WHATWG spec; but anyway this was common practice long before the HTML5 spec existed as all kinds of things put style in the body.

I found a workaround for this. you can make use of <indicator> tag surrounding <templatestyles> tag to inject the TemplateStyles. In English Wikisource I used the following source codes to do, you can get it in this commit:

<indicator name="greek-text"><templatestyles src="Greek/fonts.css" /></indicator>

With this indicator the TemplateStyles loaded only once, but not affect any paragraphs. It would be nice if TemplateStyles extension can make the similar implamentation.

With this indicator the TemplateStyles loaded only once, but not affect any paragraphs. It would be nice if TemplateStyles extension can make the similar implamentation.

TemplateStyles will not be doing that sort of thing, as that would increase time to rendering of the above-the-fold content and T155813 concluded that we specifically don't want to do that.

TemplateStyles will not be doing that sort of thing, as that would increase time to rendering of the above-the-fold content and T155813 concluded that we specifically don't want to do that.

So is there anyway to avoid that?

Change 486825 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] RemexCompatMunger: Don't split p-wrapping on style/link tags

https://gerrit.wikimedia.org/r/486825

Change 486825 merged by jenkins-bot:
[mediawiki/core@master] RemexCompatMunger: Don't split p-wrapping on style/link tags

https://gerrit.wikimedia.org/r/486825

Change 488100 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Update pwrap.js wrt templatestyles p-wrapping expectations

https://gerrit.wikimedia.org/r/488100

Change 488100 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update pwrap.js wrt templatestyles p-wrapping expectations

https://gerrit.wikimedia.org/r/488100

ssastry claimed this task.

Yes.