Migrate some semantic information from data-parsoid to data-mw
Open, NormalPublic

Description

Information about whether a native wikitext construct or a HTML tag generated a HTML dom element is currently recorded as part of the "stx": "html" flag in data-parsoid. Data-parsoid is considered private information and is not exposed to clients (while they are free to inspect it, there are no guarantees that the spec and format of it won't change without notice). However data-mw is public semantic information.

This HTML syntax flag seems like semantic information as far as serializability is concerned. For example, there is no way currently for clients to specify that Parsoid should generate a HTML tag during serialization since data-parsoid is not something they should be messing with.

This also matters for https://gerrit.wikimedia.org/r/#/c/208993/ where we are currently blanket-stripping data-parsoid but this means that some information is lost where the subst-ing produces HTML tags.

ssastry created this task.May 29 2015, 10:42 PM
ssastry updated the task description. (Show Details)
ssastry raised the priority of this task from to Normal.
ssastry added a subscriber: ssastry.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
GWicke moved this task from Backlog to Future on the RESTBase board.Jun 29 2015, 5:26 PM
cscott added a subscriber: cscott.Feb 11 2016, 7:18 PM

Can we bikeshed on a proper name? "stx: html" isn't really intuitive for an externally-visible API.

data-mw='{"repr":"html"}' or data-mw-repr="html"? repr being short for representation. But that's not terribly better than stx. Better ideas?

The values are html and wt? Is it an observation or a command to Parsoid? If the latter, should it be serialiseAs or something?

<b serializeAs="html">
<b serializeAs="wt">

Of course one or the other of those (the latter?) would presumably be the default and wouldn't need to be spelled out.

The values are html and wt? Is it an observation or a command to Parsoid? If the latter, should it be serialiseAs or something?

Like all data-mw info, it is read/write. It is best to think of it as a property declaration.

Alternative to 'stx': 'html" would be 'html': true

Yeah, though the serialisation format might (?!) differ between tags, e.g. <video>? Also <b serializeAs="md"> for future-proofing. :-)

I am going to veto the 'serializeAs' attribute. Let us continue to use declarations.

cscott added a comment.EditedFeb 11 2016, 7:29 PM

@ssastry oh, agreed. I just spelled it as an attribute rather than embedded within data-mw because I was lazy about typing.

<b data-mw='{"serializeAs": "html" }'>

But I'm not sure I agree with @Jdforrester-WMF that this should have the imperative tense and be a "command to Parsoid". This is also output from the wt2html pass. It ought to describe the content's preferred format, not command parsoid.

But I also agree with @Jdforrester that we might want to leave room for future fine distinctions in wikitext presentation. Maybe format is a good name? format=html and format=wt, with future expandability for tags which can be serialized in multiple ways. Maybe even format=wt-singleline or something like that to handle some whitespace issues?

This is also output from the wt2html pass.

Yes, but in the context of "on round-trip you must re-serialise as this lest we break things", right?

This is also output from the wt2html pass.

Yes, but in the context of "on round-trip you must re-serialise as this lest we break things", right?

I'm not totally sure. You might have a third party data source of Parsoid-format DOM, where you're just indicating "this was originally an html tag" or "this will be more readable if you format it in the following way when rendered as wikitext". And serializeAs implies there's only one serialization (to wikitext) -- it might be more forward-thinking to scope this as a hint about *wikitext* serialization specifically.

Note that this patch, whenever it is written and deployed, could result in a complete re-rendering of a lot of the stored content in RESTBase .. since most pages have some html tag or the other. The thing to check is if we set these attributes on templated content. If not, then fewer pages will be affected, but nevertheless a large number of pages are likely to get re-rendered.

GWicke moved this task from Backlog to watching on the Services board.Jul 11 2017, 10:56 PM
GWicke edited projects, added Services (watching); removed Services.