Page MenuHomePhabricator

Consistently use the same render for html2wt processing after an edit
Closed, ResolvedPublic

Description

Parsoid element ids are not stable at this point. A template re-expansion and possibly even a non-deterministic processing order can currently change the ID assignment between different renders of the same article revision.

Currently we only require the oldid to be passed in for html2wt conversion. Given the oldid, we'll then send html and data-parsoid of the latest render to Parsoid. However, Parsoid assumes that the received data-parsoid actually corresponds to the edited HTML's render. If ID assignment changed, this can lead to major corruptions.

I see two main solutions to this issue:

  • Parsoid: T93715: [EPIC] Make Parsoid HTML output completely deterministic. It should be possible to do this for the vast majority of cases, but there will still be some extreme cases of templates restructuring the page that could change the ID assignment.
  • Require clients to pass in the original etag (tid) in the html2wt request, possibly in a If-Match header or the path and use that to retrieve the exact render this edit is based on. This is safe, but more complex for clients and impinges on RESTBase's ability to garbage-collect older renders in a timely manner (see T94196).

See also:

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added projects: RESTBase, Parsoid.
GWicke added a subscriber: GWicke.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2015, 3:59 PM
GWicke triaged this task as High priority.Mar 30 2015, 4:00 PM
GWicke set Security to None.

I don't think this is related to T93715 at all .. T93715 is related to about-ids. element ids used for data-parsoid are assigned in a pass on the final DOM, and as long as the DOM has not changed between renders, the element ids should be identical. And, if rendering has changed because of template updates, then nothing we do in Parsoid can fix this. So, It seems like the reliable solution is to rely on etags for finding the right version for fetching data-parsoid.

ssastry added a subscriber: Arlolra.EditedMar 30 2015, 4:20 PM

An alternative is to change the id-assignment mechanism in Parsoid ... we would have to base ids using some form of tree-hashing.

  • Assign ids based on a depth-first traversal
  • For leaves, assign ids based on a md5-hash of leaf.textContent (or something similar)
  • For a non-leaf node, assign ids based on a md5-hash of the attributes of the node and the previously assigned elt ids of child nodes

This ensures that if a DOM node gets the same id across renders, then the content hasn't changed. So, in the worst case, a re-render introduces normalization-based dirty-diffs because of missing data-parsoid entries.

Of course, changing the id assignment => all the existing RESTBase content is effectively invalidated.

@ssastry, T93715 is generally about making the HTML output as deterministic as possible. About ids are one special issue mentioned there, but so are Cite ids or element ids in the presence of template re-renders.

I was also wondering about a tree-based assignment, but more along the lines of T87556. Anchoring ids to sections would at least contain the effect of most non-determinism in a subtree, rather than the global change we currently see where all ids past a change point are re-numbered.

GWicke updated the task description. (Show Details)Mar 30 2015, 6:54 PM

@GWicke: All I am saying is that the issue here is not about non-determinism at all. Element ids are assigned deterministically (cite ids and about ids from T93715 are not relevant to this problem). The only time element id assignment changes is when templating change triggered rerendering leads to change in output, in which case, you do want element ids changed. Besides the alternative of etags, content-based id assignment eliminates mass changes and keeps the old ids usable.

@ssastry, we want to minimize the change in element ids after template expansion changes. An additional element produced by a template should not cause all following elements to receive new ids.

I do think that structural ids can help to minimize the amount of change. Content hash based ids don't seem to have that property, as even minor changes (ex: text of template changed) will trigger a reassignment all the way up the tree.

As discussed on IRC, without assigning element ids to all elements on a path, structural ids as proposed in T87556 don't prevent page corruption in all scenarios since any edit that simply moves around nodes between two paths can assign incorrect metadata to the edits. This is especially an issue for selser since DSR info is stored in data-parsoid (we saw an example of this with duplicate ids).

So, the guaranteed safe solution is to pick the right render on which an edit is based.

<gwicke> we can ask clients to send an if-match header with the original etag
<gwicke> but we don't want to necessarily keep every render around forever
<subbu> seems reasonable to me since it shouldn't be too difficult .. but yes, extra bookkeeping involved on the lcinet side.
<gwicke> could set a reasonable TTL though
<gwicke> like a day
<gwicke> alternatively, Parsoid could provide a uuid in a etag header, and also include that uuid in the html head
<gwicke> we could then pick that out of a client's returned html
<subbu> how would you pick the render based on it?
<subbu> add it as a key?
<gwicke> we'd use that etag to store the render
<gwicke> rather than creating one from scratch, as we do now
<gwicke> of course, we could also add the etag in the html on the way out
<gwicke> feels hacky though
<gwicke> for native Parsoid HTML for views that won't work too well either
<subbu> ya .. ok .. so, we could add a meta tag in the header.

So, let us pick an etag-based solution that can work for both RESTBase and clients. I agree that ideally we would have a solution that doesn't burden clients with additional bookkeeping. In that regard, meta-tag in the header seems reasonable.

It sounds like we could make some short-term progress by adding a <meta property="mw:TimeUuid" content="e81e03cc-cfc8-11e4-9234-ada4f57e984b"/> in the head, which would match the etag header. This should already work with VisualEditor as-is.

@ssastry, @Catrope: sounds like a reasonable plan?

It sounds like we could make some short-term progress by adding a <meta property="mw:TimeUuid" content="e81e03cc-cfc8-11e4-9234-ada4f57e984b"/> in the head, which would match the etag header. This should already work with VisualEditor as-is.
@ssastry, @Catrope: sounds like a reasonable plan?

Sounds good to me.

Works for me. No changes required in any other RESTBase clients either .. FYI: @cscott, @Kelson

GWicke added a comment.EditedMar 30 2015, 11:58 PM

Patch supporting both the meta tag and If-Match header in https://github.com/wikimedia/restbase/pull/222. It does inject the meta tag at save time if not set by Parsoid.

GWicke closed this task as Resolved.EditedApr 1 2015, 1:24 AM
GWicke claimed this task.

Resolving since the patch is merged.

We'll likely deploy tomorrow, after which content with the mw:TimeUuid meta tag will start to be stored for new renders. It won't be added to stored content, so the percentage of HTML that transparently round-trips will only grow gradually.

We prefer if clients send the If-Match header explicitly instead of relying on the meta tags in HTML content.