Page MenuHomePhabricator

Tidy strips whitespace after HTML tags AND adds newlines between HTML tags
Closed, DeclinedPublic

Description

Tidy replaces whitespace in HTML tags, so <li> foo </li> or <td> bar </td> renders as <li>foo</li> or <td>bar</td>. Separately, it also adds a newline after tags. So, <li>x</li><li>y</li> renders as <li>x</li>\n<li>y</li>. In most cases, this dual whitespace mangling behavior has no effect on rendering or display. However, in scenarios where there is a CSS rule that has an effect on whitespace (ex: white-space: nowrap or the ref hlist css rules in enwiki common.css that uses the display:inline property for list items), the stripped whitespace in tags OR the added newlines between tags becomes important). For any other tool that processes wikitext / HTML output without mangling whitespace, there can be rendering differences.

Original bug description below
When I go to edit just about any page invoking the .hlist CSS (ref common.css), I see a preview where any leading curved bracket has a space inserted after it (which is not present in the wikitext nor indicated by the css). Example template using sublists.

I would expect this is a Parsoid delta. Firefox 50+ on Windows 7/10.

Screenshot comparing View to VE-edit-mode. Note the added spaces.

Selection_002.png (730×1 px, 155 KB)

Event Timeline

Izno renamed this task from Preview of .hlist CSS fpr sublists on en.WP differs from PHP render to Preview of .hlist CSS for sublists on en.WP differs from PHP render.Jan 18 2017, 3:41 PM

This is Tidy behaving nonsensically and pretending as if messing with whitespace doesn't matter. This is not a Parsoid issue.

But, if you take a list like this (note the whitespace after the bullet)

* a

the output is "<li> a</li>" (note the whitespace). Tidy then goes ahead and removes the whitespace completely (which is broken behavior). That is the reason why the navbox you mention displays as it does. Parsoid doesn't remove the whitespace.

We are in the process of replacing Tidy (see T89331 and https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy) and pages and templates that rely on this whitespace mangling behavior of Tidy would need to be fixed up. We'll talk about how this and see if there is a way to surface these issues automatically.

So, you can get the correct behavior in Parsoid (and VE) and in the future when we replace Tidy by removing whitespace after bullets in https://en.wikipedia.org/w/index.php?title=Template:The_Legend_of_Zelda

ssastry renamed this task from Preview of .hlist CSS for sublists on en.WP differs from PHP render to Tidy strips whitespace after <li> tags.Jan 19 2017, 3:52 PM
ssastry triaged this task as Medium priority.

This is Tidy behaving nonsensically and pretending as if messing with whitespace doesn't matter. This is not a Parsoid issue. [snip] We are in the process of replacing Tidy (see T89331 and https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy) and pages and templates that rely on this whitespace mangling behavior of Tidy would need to be fixed up. We'll talk about how this and see if there is a way to surface these issues automatically.

Do you realize the predominant (nay, overwhelming) method of indicating list items on probably each and every wiki (including 3rd parties) is to include a space between the list item wikitext (; : * #) and the text-content of that list item?

This is Tidy behaving nonsensically and pretending as if messing with whitespace doesn't matter. This is not a Parsoid issue. [snip] We are in the process of replacing Tidy (see T89331 and https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy) and pages and templates that rely on this whitespace mangling behavior of Tidy would need to be fixed up. We'll talk about how this and see if there is a way to surface these issues automatically.

Do you realize the predominant (nay, overwhelming) method of indicating list items on probably each and every wiki (including 3rd parties) is to include a space between the list item wikitext (; : * #) and the text-content of that list item?

In the vast majority of cases, that whitespace does not make a difference to rendering. It is only when there is special CSS that affects whitespace that this matters.

But, that apart, my observation is more that the rendering behavior you reported here is not a deliberately specified behavior of wikitext as far as I can tell. It just happens to work that way because of how Tidy does it. So, if the consensus is that whitespace between list items and the text content should be ignored, then that should be deliberately accepted as wikitext parsing behavior and both the PHP parser and Parsoid should be fixed to implement that behavior explicitly.

I guess I am also lacking some knowledge here about Html5 that is not easily discover-able that may have influenced my original report: Where in the HTML 5 specification is the expectation established that whitespace post-element pre-element-content-of-interest is retained exactly?

ssastry renamed this task from Tidy strips whitespace after <li> tags to Tidy strips whitespace after HTML tags.Jan 24 2017, 2:30 AM
ssastry renamed this task from Tidy strips whitespace after HTML tags to Tidy strips whitespace after HTML tags AND adds newlines between HTML tags.
ssastry updated the task description. (Show Details)

I guess I am also lacking some knowledge here about Html5 that is not easily discover-able that may have influenced my original report: Where in the HTML 5 specification is the expectation established that whitespace post-element pre-element-content-of-interest is retained exactly?

The HTML5 tree builder algorithm has a detailed algorithm for how a HTML document should be parse. As far as I can tell, it does not do any whitespace normalization. So, any whitespace characters seen in the input are carried over into the DOM faithfully (with some exceptions around <head>, <html> and <doctype> elements).

Separately, not that if a browser / library were free to mangle whitespace within and outside elements arbitrarily, then CSS rules like display:inline (when applied to "block" elements) or white-space:nowrap would never behave consistently across user-agents because Firefox might choose to remove whitespace and Chrome might leave that behind and then the page would look different depending on which browser was used to view the page.

@Izno: HTML 4 specified how whitespace in text nodes should be displayed, requiring that user agents ignore or collapse whitespace in many contexts. Tidy took advantage of this in order to "pretty print" the HTML -- removing whitespace from the start and end of elements, and adding line breaks between elements, in a way which was (at the time) not user visible. However, this was broken by the introduction of CSS, which specified whitespace handling which conflicted with the requirements of HTML 4. Browsers soon followed CSS, not HTML 4.

HTML 5 is very specific about the retention of whitespace when constructing a DOM. This does not conflict with HTML 4, which did not have a concept of a DOM. But the whitespace collapsing rules from HTML 4 were not reproduced in HTML 5 -- this is left as a responsibility of CSS.

So in summary, HTML 5 requires that whitespace be preserved in the DOM, and CSS defines a concept of significant whitespace which contradicts HTML 4.

The HTML5 tree builder algorithm has a detailed algorithm for how a HTML document should be parse. As far as I can tell, it does not do any whitespace normalization. [snip]

[snip] So in summary, HTML 5 requires that whitespace be preserved in the DOM, and CSS defines a concept of significant whitespace which contradicts HTML 4.

Okay, I'm satisfied that this is an artifact produced because of the difference between using HTML 5 construction rules and using HTML 4 construction and/or cleaning rules.

In that case, I might indeed recommend allowing for spaces in Parsoid/PHP parser, at least in list item-related wikitext. The behavior difference between a normal list and one which uses some CSS or some such would be inconsistent to end-users reviewing the wikitext of a template vice that of a normal page (at least for old-timers, who are the predominant users of wikitext these days I would guess). The description's example of hlist makes the delta observable (and hlist is surely the best alternative for WCAG reasons). Templates adding hlist are also used in article space--reference en:Template:Flatlist and en:Template:Hlist (one of which enjoys some 70k uses and the other of which enjoys some 110k uses), so not fixing this would produce an inconsistency even within article text of how bullets need to behave if the whitespace is not collapsed.

Additionally, the whitespace in the wikitext aids legibility of wikitext, especially when used with sublists. It's hard to figure out where a sublist begins--even, if it begins--when the bullets are "scrunched up" next to the list item's content.

Were this problem were put in the "to fix" bucket, I would caution on dirty diffs removing whitespace between list item wikitext and list item content, which might occur if the parser of choice were to "enforce" "good" behavior (i.e. no whitespace).

In that case, I might indeed recommend allowing for spaces in Parsoid/PHP parser, at least in list item-related wikitext. The behavior difference between a normal list and one which uses some CSS or some such would be inconsistent to end-users reviewing the wikitext of a template vice that of a normal page (at least for old-timers, who are the predominant users of wikitext these days I would guess). The description's example of hlist makes the delta observable (and hlist is surely the best alternative for WCAG reasons). Templates adding hlist are also used in article space--reference en:Template:Flatlist and en:Template:Hlist (one of which enjoys some 70k uses and the other of which enjoys some 110k uses), so not fixing this would produce an inconsistency even within article text of how bullets need to behave if the whitespace is not collapsed.

Additionally, the whitespace in the wikitext aids legibility of wikitext, especially when used with sublists. It's hard to figure out where a sublist begins--even, if it begins--when the bullets are "scrunched up" next to the list item's content.

Note that we don't intend to change behaviour of wikitext. So, yes, you can continue to use whitespace after bullets. If anything, we are saying that once Tidy is removed, whitespace in wikitext will be preserved as it exists.

The discussion here is about whitespace in generated HTML. Tidy is *removing* that whitespace in list items which a Tidy replacement (or Parsoid) won't do anymore. But, it seems you are asking us to "fix" wikitext so that leading / trailing whitespace in list items (and possibly table cells?) are stripped.

Were this problem were put in the "to fix" bucket, I would caution on dirty diffs removing whitespace between list item wikitext and list item content, which might occur if the parser of choice were to "enforce" "good" behavior (i.e. no whitespace).

No dirty diffs since there is no plan to strip whitespace now. But, if there is a proposal to make whitespace stripping part of wikitext parsing, when that is implemented, yes, Parsoid will take care not to dirty diff.

The discussion here is about whitespace in generated HTML. Tidy is *removing* that whitespace in list items which a Tidy replacement (or Parsoid) won't do anymore.

Yes, I got that.

But, it seems you are asking us to "fix" wikitext so that leading / trailing whitespace in list items (and possibly table cells?) are stripped.

That is indeed what I'm proposing. I suppose you could do this for any HTML structure, but at least those listed. Maybe extend that list to all with a wikimarkup equivalent (so I suppose that also includes headers, table header cells, table summaries, tables themselves?, though the pain would these would not be felt so obviously).

Were this problem were put in the "to fix" bucket, I would caution on dirty diffs removing whitespace between list item wikitext and list item content, which might occur if the parser of choice were to "enforce" "good" behavior (i.e. no whitespace).

No dirty diffs since there is no plan to strip whitespace now. But, if there is a proposal to make whitespace stripping part of wikitext parsing, when that is implemented, yes, Parsoid will take care not to dirty diff.

Thanks for your feedback. We'll discuss this within the team as to how to proceed with this.

T157418: RFC: Make some aspects of Tidy's whitespace stripping behavior part of wikitext parsing "spec" addressed the most important pieces of this issue that affect hlist formatting. We are not going to replicate Tidy behavior wrt "pretty-printing" of HTML.