Page MenuHomePhabricator

Move single-line-context-enforcement and meta-tag hoisting from headings to DOM normalization phase
Open, MediumPublic

Description

Normalizing the DOM in these ways before passing it to the serializer simplifies the serialization and also makes for a simpler and robust codebase.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry subscribed.

Change 213855 had a related patch set uploaded (by Arlolra):
Move meta-tag hoisting from <h*>s to DOM normalization phase

https://gerrit.wikimedia.org/r/213855

Meta-tags are done in the above patch.

Single-line-context-enforcement probably isn't going to be so straightforward (at least I'd need to think about it / research some more).

The general idea would just be to climb the tree and check if we're inside a heading or a list item.

There are few tricky bits from the current implementation where sl-context is temporarily disable to preserve nls.

  • Templates: but here we'd just need to skip encapsulated content, which the normalizer already does.
  • List elements (ul or ol, as opposed to list item elements, li): we're currently doing that to preserve separators between list items which would otherwise be suppressed. This wouldn't be an issue with the above idea.

However, separators in general would be an issue. For example, categories serialize to a new line and would need to be aware of their context. The current stack based approach takes care of this pretty cleanly.

Is this the optional normalization phase, or something you'll apply to all html2wt conversions?

@GWicke: It is currently applied to all html2wt conversions and would remain so. It's more of a code reorganization. The implementation was done in T52683.

Change 213855 merged by jenkins-bot:
Move meta-tag hoisting from <h*>s to DOM normalization phase

https://gerrit.wikimedia.org/r/213855

  • We serialize categories to a new line only if it is a newly inserted category link. In any case, VE/clients shouldn't be inserting category links inside list items in the first place. Worth testing in VE.
  • We cannot handle arbitrary HTML via newline normalization only -- in some scenarios, serialization will require switching from wikitext to html tag mode. But, we aren't there yet.
  • If list items / headings / other single-line-context wikitext constructs only had phrasing content (no P/PRE tags), deep newline suppression would always work. (Exception: We do handle a p-tag inside li/td children right now via separator constraints, I think).

It it would be good to simplify the core serialization code by migrating it out, but yes, moving the code out of the core serializer into the normalization routine would be a degradation of functionality since we can handle more scenarios right now by leaving it in core. But, as long as VE doesn't do crazy things like try to insert multiple paragraphs, tables, pres, etc inside these single-line-context constructs, it is okay. Let us sit on this for a little while.

We have been hitting this Parsoid-VE edge several times in recent months. VE wants to be a HTML-only editor (without any knowledge of wikitext) and Parsoid expects the HTML to conform to certain wikitext norms for proper serialization. There is also a gap in our DOM spec in that we don't specify the content model we accept wrt serializability to "good wikitext". We clearly don't accept / cannot handle arbitrary HTML, but we don't specify the content model we accept / guarantee good serialization on. Our DOM spec is only about the HTML Parsoid generates, not what it can handle as input. So, if we can close that gap, it is easier to then design normalization passes / libraries for massaging cut-paste from different sources, etc.

Arlolra renamed this task from Technical-Debt: Serializer: Move single-line-context-enforcement and meta-tag hoisting from headings to DOM normalization phase to Move single-line-context-enforcement and meta-tag hoisting from headings to DOM normalization phase.Jun 3 2015, 6:58 PM

An alternative idea from @ssastry on https://gerrit.wikimedia.org/r/#/c/223603/

At the site where separators are being emitted, you can look at the context to see if it is a single-line context node by walking up the dom, examining siblings, etc. maybe? However, since that can be an expensive proposition to do for all separators, perhaps what you can do is introduce a 'maybeSLC' flag in state that is set whenever you enter lists, which you can then use to examine separators only when the flag is set?