Page MenuHomePhabricator

Parsoid "HTML4BlockTags" tokens are a lie
Open, MediumPublic

Description

  1. It contains HTML5 tags
  2. It contains things which are inline: button, canvas, embed, map, object, progress, video (https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements#Elements)

We duplicate this list in VE, so that would need to be fixed too.

Event Timeline

It contains HTML5 tags

https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/381894 is the source of this confusion ... so we just need a different name. :)

It contains things which are inline: button, canvas, embed, map, object, progress, video (https://developer.mozilla.org/en-US/docs/Web/HTML/Inline_elements#Elements)

Looks like this list came from a refactoring in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Parsoid/+/78092 which existed in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Parsoid/+/78092/6/js/lib/mediawiki.Util.js
And, it looks like the other tags came (in Sep 2012) from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Parsoid/+/24502/4/js/lib/mediawiki.Util.js .. there is no explanation as to why I added those additional tags there.

Did https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements at one point have these other tags listed there?

But, yes, looks like we need a bunch of renaming and updating to do here.

Did https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements at one point have these other tags listed there?

Since https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Parsoid/+/78092/6/js/lib/mediawiki.wikitext.constants.js added a link to that webpage, that is one speculative guess especially since @cscott +2ed that refactor. :) Otherwise, where might I have pulled that rather arbitrary list of tags from?

If only they used a wiki, we could have resolved this mystery. :)

It occured to me that archive.org is the next best thing we have to a wiki ... and https://web.archive.org/web/20140719053636/https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements is the oldest archived version and looks like my speculation was correct. This version from 2014 lists video, aside, hr, canvas.

So, it is interesting (and we never realized this) that a bunch of tags changed their block-vs-inline behavior since we first created this list ... either a documentation issue or a change in html5 spec wrt to their flow / phrase behavior.

Maybe we need to call that list NonPhrasingFlowTags and remove the phrasing tags from that list.

Re: <video>, the comment there says,

# However, you probably want to use `TokenUtils.isBlockTag()`, where some
# exceptions are being made.

That's from https://github.com/wikimedia/parsoid/commit/b05fcc3ffa48fe9b44249978c6115a636e48dadc#diff-74497db19a79a3c03be1291d860ec014

where TokenUtils::isBlockTag() has,

return $name !== 'video' && isset( Consts::$HTML['HTML4BlockTags'][$name] );

The history of that is here,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/335956/26//COMMIT_MSG#12
where it's argued,

Yeah, but my point is we can't do that by lying about whether video is an html4 block tag (which it is) because someone will inevitably want to "maintain" the list by resynchronizing it with the actual list of html4 block tags and be confused when everything breaks. Instead of lying about the html4 properties, we need to tweak the p-wrapper (via Util.isBlockTag) for now... at least until we remove it/rewrite it to use HTML5 properties/something else.

Change 604878 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Update Parsoid's understanding of "block" and "inline" tags

https://gerrit.wikimedia.org/r/604878

Change 604878 abandoned by Subramanya Sastry:
Update Parsoid's understanding of "block" and "inline" tags

https://gerrit.wikimedia.org/r/604878

ssastry triaged this task as Medium priority.Jun 15 2020, 5:16 PM
ssastry raised the priority of this task from Medium to Needs Triage.
ssastry triaged this task as Medium priority.
ssastry moved this task from Needs Triage to Tech Debt / Big changes on the Parsoid board.

Change 608970 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Limit to tags which generate implied end tags

https://gerrit.wikimedia.org/r/608970

Change 605281 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Use $onlyInlineElements from Remex when DOM p-wrapping

https://gerrit.wikimedia.org/r/605281

Change 608968 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Move doBlockLevel constants to WikitextConstants

https://gerrit.wikimedia.org/r/608968

Change 608969 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Combine the doBlockLevel tags and use them in the token stream

https://gerrit.wikimedia.org/r/608969

Change 610182 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Get rid of WikitextConstants::$HTML['HTML4BlockTags']

https://gerrit.wikimedia.org/r/610182

Change 610183 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Get rid of WikitextConstants::$SolSpaceSensitiveTags

https://gerrit.wikimedia.org/r/610183

Change 610184 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Remove Consts::$BlockScopeOpenTags && Consts::$BlockScopeCloseTags

https://gerrit.wikimedia.org/r/610184

Change 610185 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Get rid of $WeakIndentPreSuppressingTags and $StrongIndentPreSuppressingTags

https://gerrit.wikimedia.org/r/610185

Wrt HTML4BlockTags use and a general historical philosophy in Parsoid, https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/608968/3/src/Config/WikitextConstants.php#26, @ssastry says,

The original intention was indeed to deliberate[ly] break some forms of compatibility for more coherence.

However, over time, we've backtracked from that ideal where the breakages were too great, to the point where the contents of that set are now all muddled.

The above patch chain is effectively a reset. It uses nomenclature from the legacy parser so that we know how things stand and where they came from. It's a point of departure so that we can lint and deliberately achieve that cohesion, in the tidy to Remex way.

We tried to use the notion of "block tags" as uniformly as possible throughout Parsoid and where block tag is a notion that wikitext inherits from its HTML4 roots and hence needs to be preserved for legacy compatibility reasons. But, as you note, over the years, we have had to introduce a number of potentially confusing concepts like weak / strong indent-pre suppressing tags, block scope opening/closing tags as Parsoid sought to achieve greater compatibility with core parser output.

If it were just HTML4BlockTags that is at issue. it is possible to rectify the corruption of that set (as I set out to do in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/604878).

I think you are digging beyond just that fix and looking at the notion of the block tag (and other associated wikitext constants I mentioend above) and see how and why we are using those constants and are trying to go back to doing a full reset to use core parser 'doBlockLevel' notions (and in the process, discarding HTML notions and all the other artificial constants we ended up introducing along the way as part of the backtracking you flag there).

All that said, I am okay with this reset with the understanding that once we switch over to Parsoid, as you note, we can clean out some of these inherited notions with more deliberation using linter-based strategies we used as we transitioned from Tidy to Remex.

Thanks for the deep dive! :-)

Change 605281 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use $onlyInlineElements from Remex when DOM p-wrapping

https://gerrit.wikimedia.org/r/605281

Change 610945 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Rewrite sep indent pre suppressing in terms of block scope

https://gerrit.wikimedia.org/r/610945

Change 612635 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612635

Change 612634 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@wmf/1.35.0-wmf.41] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612634

Change 612635 merged by jenkins-bot:
[mediawiki/vendor@master] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612635

Change 612498 had a related patch set uploaded (by Jforrester; owner: C. Scott Ananian):
[mediawiki/vendor@REL1_35] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612498

Change 608968 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move doBlockLevel constants to WikitextConstants

https://gerrit.wikimedia.org/r/608968

Change 608969 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Combine doBlockLevel tags to form a notional wikitextBlockElems

https://gerrit.wikimedia.org/r/608969

Change 612634 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.35.0-wmf.41] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612634

Change 612498 merged by Jforrester:
[mediawiki/vendor@REL1_35] Bump Parsoid to v0.12.0-a22

https://gerrit.wikimedia.org/r/612498

Mentioned in SAL (#wikimedia-operations) [2020-07-14T19:52:46Z] <jforrester@deploy1001> Synchronized php-1.35.0-wmf.41/vendor/wikimedia/parsoid/: T252448 T255190 Bump Parsoid to v0.12.0-a23 (duration: 01m 06s)

Change 608970 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Be precise about why list item ends are implied

https://gerrit.wikimedia.org/r/608970

Change 610182 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Get rid of WikitextConstants::$HTML['HTML4BlockTags']

https://gerrit.wikimedia.org/r/610182

Change 610183 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Get rid of WikitextConstants::$SolSpaceSensitiveTags

https://gerrit.wikimedia.org/r/610183

Change 610184 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Get rid of WikitextConstants::$BlockScope(Open|Close)Tags

https://gerrit.wikimedia.org/r/610184

Change 610185 abandoned by Arlolra:
[mediawiki/services/parsoid@master] Get rid of WikitextConstants::$(Strong|Weak)IndentPreSuppressingTags

Reason:
Squashed with https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/ /610945

https://gerrit.wikimedia.org/r/610185

Change 614872 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a1

https://gerrit.wikimedia.org/r/614872

Change 614872 merged by jenkins-bot:
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a1

https://gerrit.wikimedia.org/r/614872

Arlolra edited projects, added VisualEditor; removed Technical-Debt, Patch-For-Review.

We duplicate this list in VE, so that would need to be fixed too.

@Esanders As of T255190#6313944, we've removed HTML4BlockTags. But why are you copying this list? Is it to mimic some Parsoid specific behaviour that we should account for or just a convenient place to lookup what was once considered block tags?

@Esanders we are done with this on our end. So, untagged Parsoid. Feel free to create a new task for VE or adopt this.

Krinkle renamed this task from HTML4BlockTags is a lie to Parsoid "HTML4BlockTags" tokens are a lie.Jul 23 2020, 7:16 PM

Change 610945 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Rewrite sep indent pre suppressing in terms of block scope

https://gerrit.wikimedia.org/r/610945

Change 616335 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a2

https://gerrit.wikimedia.org/r/616335

Change 616335 merged by jenkins-bot:
[mediawiki/vendor@master] Bump parsoid to 0.13.0-a2

https://gerrit.wikimedia.org/r/616335

JTannerWMF added subscribers: ppelberg, JTannerWMF.

@ssastry says the leftover work is on the Editing team so @ppelberg is going to check in with @Esanders to prioritize

@Esanders As of T255190#6313944, we've removed HTML4BlockTags. But why are you copying this list? Is it to mimic some Parsoid specific behaviour that we should account for or just a convenient place to lookup what was once considered block tags?

VE needs to know the difference between inline and block tags to sensibly split and place new content, for example inline content can be inserted into a paragraph (and indeed must be wrapped in a paragraph or some other content branch), but inserting a table into a paragraph should result in the paragraph being split.

@ssastry says the leftover work is on the Editing team so @ppelberg is going to check in with @Esanders to prioritize

Talked with Ed, he's going to finish up the implementation in VE.

Reason: implementation is straightforward and Ed has been thinking about this so he has context about what needs to be done where.

VE needs to know the difference between inline and block tags to sensibly split and place new content, for example inline content can be inserted into a paragraph (and indeed must be wrapped in a paragraph or some other content branch), but inserting a table into a paragraph should result in the paragraph being split.

To provide a bit of background on the work that was done above. HTML5 does away with the concepts of inline and block.

Parsoid has two places where it does paragraph wrapping, similar to the legacy parser.

The token stream paragraph wrapping mimics what's happening in the legacy parser's doBlockLevels,
https://github.com/wikimedia/parsoid/blob/master/src/Wt2Html/TT/ParagraphWrapper.php
using the tags defined there,
https://github.com/wikimedia/parsoid/blob/master/src/Config/WikitextConstants.php#L139-L160

Parsoid also does DOM based wrapping, similar to what Remex does,
https://github.com/wikimedia/parsoid/blob/master/src/Wt2Html/PP/Processors/PWrap.php
using HTML5 phrasing content,
https://github.com/wikimedia/parsoid/blob/master/src/Config/WikitextConstants.php#L206-L220

I imagine you want something like the latter (HTML5 vs wikitext concepts). Here's the function that should give what's considered a block,
https://github.com/wikimedia/parsoid/blob/master/src/Utils/DOMUtils.php#L149-L159

Unless there are significant htm2html considerations, does VE have to know about these broken wikitext notions? I think it is better for VE to focus on HTML(5) concepts.

I'm moving this from "Ready for Development" considering I do not understand this work to be high priority and thus do not foresee us working on this in the near-term.

@Esanders if you think this work is higher priority than I'm currently understanding it to be, please let me know as much.