Page MenuHomePhabricator

Move data-parsoid and possibly data-mw out of the DOM, add uids
Closed, ResolvedPublic0 Estimated Story Points

Description

We need a general way to associate information with DOM nodes without having that information inline. The current idea is to set an UID on each DOM node that has associated information, and use that as the key to externally stored metadata. This can then be applied to remove private information like data-parsoid from the DOM we send to the client.

An issue to consider is copy & pasting between pages of the same wiki or even different wikis.

A simple and safe solution would be to discard all associated private information for modified (copy & pasted) content. This means that we would have to leave all semantic information (data-mw primarily) in the DOM even on page views. It also means that blame map information for example would be lost when a paragraph is moved around.

An alternative would be to make uids unique in a wiki, or even across wikis. Example: <wiki id>:<revision id>:<node id>. 1000:40233066:100000 for example can be encoded as Po:CZehq:Yag. This would allow us to move data-mw out of the view DOM as well, and would open up interesting ways to preserve associated metadata like blame maps across copy & pastes. The wiki id would need to be unique though, and there would need to be a public API to retrieve associated metadata. When the wiki id is not recognized or data retrieval fails, we might lose the associated data-mw as well.


Version: unspecified
Severity: normal

See also: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Element_IDs

Details

Reference
bz52936

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:51 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz52936.

As for data-mw, moving inlined data-mw into a single global data-mw JSON (or JSON-LD) object with information about all typed nodes might be a good intermediate fix.

data-mw = {

"#mwt1" : {
  @type: "mw:Transclusion"
  target: {...},
  params: {...}
},
#mwt2": {
  @type: "mw:Extension"
  target: "math",
  attrs: {...},  // or could be called params as well
  body: {...}
},
...

}

This way, the DOM and data about DOM will be separate and can also be served separately if necessary, or clients that dont care about this information can completely ignore this without bloating the DOM itself. It also eliminates one level of escaping and can be processed concurrently by clients like VE.

@subbu: Before we can remove data-mw from the content, we will need a solution for copy & pasting from a view. Copied HTML from a read-only page will only have attributes but not data-mw. A paste target (VE for example) would need to be able to retrieve the associated metadata like data-mw solely based on the attributes in the pasted HTML fragment. Hence the UID scheming above.

As an interim solution until we have separate storage for data-parsoid, we should consider moving data-parsoid into a single JSON structure in the head of the document and insert locally unique ids to reference it.

This will make it easier to strip this out in the VE frontend (ideally just with a regexp), and can make our output usable for mobile.

Subbu and I developed a solution for this based on simple id attributes and revision URL injection on copy. See the spec for the details:
https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec/Element_IDs

I'll start by assigning ids and moving all data-parsoid info into the head of the document as a stop-gap until we have separate metadata storage in the revision store (see bug 49143). This makes it easier to strip it in a front-end to minimize the transferred size.

Change 88395 had a related patch set uploaded by Arlolra:
WIP: Move data-parsoid into a JSON structure in the head

https://gerrit.wikimedia.org/r/88395

Change 88395 merged by jenkins-bot:
Move data-parsoid into a JSON structure outside the DOM

https://gerrit.wikimedia.org/r/88395

Support for separate data-parsoid is now merged. The next step will be to store this separately in Rashomon, so that our DOM output is actually free of this data. Keeping this bug open to track work on that as well.

Next steps:

  • Return a (JSON?) compound response with separate data-parsoid, data-mw & HTML from a Parsoid web API end point
  • Accept the same as an input, or (alternatively) pull data-parsoid from restfacerestbase separately.

Change 159111 had a related patch set uploaded by Arlolra:
Return a JSON response with separate html and data-parsoid

https://gerrit.wikimedia.org/r/159111

Change 159111 merged by jenkins-bot:
Return a JSON response with separate html and data-parsoid

https://gerrit.wikimedia.org/r/159111

All patches mentioned in this report were merged - is there more work left to do here (if yes: please reset the bug report status to NEW or ASSIGNED), or can you close this ticket as RESOLVED FIXED?

T54091 is already blocked in Firefox because it refuses to copy RDFa.

T54091 is already blocked in Firefox because it refuses to copy RDFa.

Sure, but providing a good experience for 70% of our users is better than none.

See also: T78676

Summary: To really make a difference for large / template-heavy pages, we need to remove data-mw as well. We should also look into stripping embedded data-parsoid from data-mw attributes.

The main part of this task "Move data-parsoid out of the DOM, add uids" is complete. For data-mw, see: T78676