Consider not removing multiple blank lines/white space between paragraphs
Closed, ResolvedPublic

Description

Some wikis want to use repeated blank lines to create extra white space on a page. This doesn't work in the visual editor (and is discouraged by the Manual of Style at the large Wikipedias), although it does work if you put multiple blank lines in between paragraphs in the wikitext mode.

Please consider keeping the extra blank space.

Related Objects

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2018, 6:38 PM

This normalisation is done by Parsoid, but seemingly not documented here: https://www.mediawiki.org/wiki/Parsoid/Normalizations nor do I remember it being added.

This normalisation is done by Parsoid, but seemingly not documented here: https://www.mediawiki.org/wiki/Parsoid/Normalizations nor do I remember it being added.

I am confused now ... I didn't think this was Parsoid related. I still don't have a reproducible example of what the complaint is (from https://www.mediawiki.org/wiki/Topic:Tuhfcu8edr2h8xso )

Reproduction should be very easy:

  1. Open the visual editor.
  2. Type something. Press return ten times. Type something else.
  3. Save page.
  4. Discover that your ten blank lines are now just one.

Thanks. Yes, I see that now. So, it is not some edge case behavior.

[subbu@earth tools] echo "<p>foo</p><p></p><p></p><p></p><p>bar</p>" | parse.js --html2wt
foo

bar

I was mistaken about it on the flow thread then. This is a Parsoid bug. Looks like this has been the behavior in Parsoid since the very beginning since there isn't any special normalization we are doing for this.

ssastry moved this task from Backlog to html2wt on the Parsoid board.Jan 12 2018, 11:38 PM
ssastry triaged this task as Normal priority.
ssastry raised the priority of this task from Normal to High.Feb 20 2018, 5:21 AM
cscott added a subscriber: cscott.Feb 20 2018, 4:14 PM

T175421: Parsoid doesn't insert same spacers when article text starts with two or more newlines. may be related (at least in the sense that it could be fixed at the same time, if not actually due to the same cause).

ssastry claimed this task.Feb 21 2018, 11:29 PM
ssastry added a comment.EditedFeb 22 2018, 5:06 AM

@Esanders, so, in trying to fix this bug in Parsoid, I was playing around with HTML snippets in the browser and I looked at the following three:

<h1>br tags wrapped in p tags</h1>
<p>foo</p>
<p><br/><br/></p>
<p>boo</p>

<h1>br tags</h1>
<p>foo</p>
<br/><br/>
<p>boo</p>

<h1>Empty p-tags</h1>
<p>foo</p>
<p></p>
<p>boo</p>

When I look at the rendering in the browser, only the first 2 snippets generate the necessary whitespace. The browser suppresses the empty paragraph effectively. Given that, shouldn't VE be generating br tags instead of empty p-tags?

But, one simple solution to tackling this in Parsoid would be to implement a normalization wherein empty p-tags are normalized to <br/> tags. While we can do that relatively easily, is there a reason why VE is generating empty p-tags when editors insert empty lines?

238482n375 lowered the priority of this task from High to Lowest.Jun 15 2018, 8:03 AM
238482n375 removed ssastry as the assignee of this task.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu

Aklapper raised the priority of this task from Lowest to High.Jun 15 2018, 11:20 AM
Aklapper assigned this task to ssastry.
Restricted Application added a subscriber: Daimona. · View Herald TranscriptJun 15 2018, 11:20 AM
Daimona removed a subscriber: Daimona.Jun 15 2018, 12:11 PM

Change 465656 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Normalize DOM to convert empty <p></p> tags to <p>?<br/></p>?

https://gerrit.wikimedia.org/r/465656

If we need to output <p><br></p> instead of <p></p> in the HTML we send to Parsoid, then I suppose we can do that.

But I'm not sure if VisualEditor or Parsoid is the better place to fix this.

<h1>Empty p-tags</h1>
<p>foo</p>
<p></p>
<p>boo</p>

When I look at the rendering in the browser, only the first 2 snippets generate the necessary whitespace. The browser suppresses the empty paragraph effectively. Given that, shouldn't VE be generating br tags instead of empty p-tags?

I will note that this applies to the following snippets as well. The empty element has no visible rendering (due to margin collapsing):

<h1>foo</h1>
<h1></h1>
<h1>boo</h1>
<pre>foo</pre>
<pre></pre>
<pre>boo</pre>
<blockquote>foo</blockquote>
<blockquote></blockquote>
<blockquote>boo</blockquote>

But, we probably shouldn't be inserting <br> tags inside empty elements other than <p>, because apparently they are not ignored by Parsoid (and serialize to <br> in wikitext)? This behavior is very difficult for me to understand.

Also… when Parsoid parses a bunch of newlines in wikitext and outputs a bunch of <p><br></p> structures for them, it should mark the <br> node somehow so that we can ignore it in VisualEditor. Currently VisualEditor treats it as any other <br> tag and allows inserting content on either side of it (essentially, rendering it as two empty lines instead of one).

Change 465761 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[VisualEditor/VisualEditor@master] [DNM] Output <br> inside empty <p></p> in Parsoid HTML

https://gerrit.wikimedia.org/r/465761

But, we probably shouldn't be inserting <br> tags inside empty elements other than <p>, because apparently they are not ignored by Parsoid (and serialize to <br> in wikitext)? This behavior is very difficult for me to understand.

You are right, my original reasoning on relying on how HTML renders those is flawed since none of those empty tags render in HTML either.

Refreshing my memory about the code, the reason paragraphs seem to be special is because p-tags map to newlines and we normalize newline counts to ensure core semantics are preserved. More specifically, newlines in the wt -> html direction map to either <p><br/></p> or <p><br/>some content here </p>. So, in the html -> wt direction, Parsoid expects <br/> to force extra newlines. The absence of <br/> is causing Parsoid to normalize newline counts. I think this behavior is correct since it is a mirror of what the wt -> html generates.

So, the qn. is if Parsoid should accept <p></p> and normalize it to the expected form (my patch) or should VE return HTML that matches Parsoid's expectations. Given that Parsoid normalizes other empty forms to what the core html -> wt code expects, I suppose it is reasonable to have Parsoid normalize it to either <p><br/> (1 newline) or <p><br/></p> (2 newlines), i.e. be more liberal about how it treats variations from canonical HTML. Thoughts?

Relatedly, but not directly related to this patch, this once again highlights missing documentation / spec about canonical HTML Parsoid expects (for canonical wikitext output) and how it handles variations from that. At our recent offsite, we discussed document a html -> wt spec (not just wt -> html spec that we currently publish) that makes explicit and documents this behavior and identifies inconsistencies that can be fixed.

Change 465656 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Scrub DOM to convert <p></p> sequences to NL-emitting normal forms

https://gerrit.wikimedia.org/r/465656

Change 465761 abandoned by Bartosz Dziewoński:
[DNM] Output <br> inside empty <p></p> in Parsoid HTML

Reason:
No longer needed after https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/ /465656

https://gerrit.wikimedia.org/r/465761

ssastry closed this task as Resolved.Thu, Dec 6, 4:13 PM
Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptThu, Dec 6, 4:13 PM