Page MenuHomePhabricator

Generate ids based on dsr info
Open, NormalPublic

Description

Quoting @ssastry in T222639#5164284,

So, I now have an idea for id assignment that can, in the limited case of VE (and other editing client) edits, enable us to reuse data-parsoid from a non-matching render for the same revision. Id assignment across the document itself won't be stable and that cannot be guaranteed (because wikitext). But, since html2wt only cares about the top-level page, and since top-level page wikitext itself doesn't change for a revision, all we need to do is ensure that id assignment for HTML nodes coming from the top-level page is identical across parses. And, one such mechanism is to use wikitext-offset as the id which we already compute and is present in data-parsoid.dsr. So, instead of using a single counter for the entire document, we use multiple id assignment schemes, (a) one for top-level page content that is based on wikitext offsets, and (b) another counter with a different prefix for all other encapsulated content. Of course, if we use wikitext offsets, the ids can become large (imagine an id like X_54375), so we still need to think of some compression scheme .. for example by mapping the offsets to a different counter, [ 0 => 1, 3 => 2, 20 => 3, ...., 54375 => 38, .. ] and so on. Details need to be worked out, but if we can do this, then we can make selser work even when the exact render of the revision isn't available. This is not something we will work on now, but, something we could implement. This will also need a HTML version bump when we implement this.

Event Timeline

Arlolra created this task.May 23 2019, 6:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2019, 6:23 PM
Arlolra triaged this task as Normal priority.May 23 2019, 6:23 PM

For the most part, this corresponds with the code here,
https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pp/handlers/CleanUp.js#L194-L214

The checks there can be replaced with Util.isValidDSR(dp.dsr). Multiple id assignment schemes don't seem necessary.

Basically, this amounts to, if we have valid dsr info, use that for generating an id. If not, discard data-parsoid for that node.

You may have DSR offsets for template source which may match DSR offsets for top-level content and can lead to duplicate ids. So you have to account for that .. hence the multiple id schemes thought.

dsr isn't applied in templated regions (all data-parsoid is dropped in there as is), it only corresponds to the top-level page.

So, that is good reg template content & dsr. But, there is still one other area that might necessitate the need for different scheme for id assignment for non top-level content. If we ever assign data-mw to content in templates that doesn't come from the top-level page (ex: media), and since we'll ge be removing inline data-mw, they will need id assigned to them and those will need a different scheme than top-level content.

If there's a solution to assign stable ids to the nested content, why not just use that for the entire page?

In other words, if we have to account for external data-mw, this doesn't work.

ssastry added a comment.EditedMay 23 2019, 7:47 PM

We don't have a stable id solution for the entire page (because templates can change arbitrarily across parses of the same top level revision). This is a solution for us to be able to use selser even when we don't have a stable id assignment solution.

We don't have a stable id solution for the entire page (because templates can change arbitrarily across parses of the same top level page). This is a solution for us to be able to use selser even when we don't have a stable id assignment solution.

In other words, for us to be able to use selser reliably, we only need to guarantee stable id assignment of content nodes that are present on the top-level page. All other nodes can use a different id assignment scheme.

Right, I think I've exhausted my usefulness here, thanks.

I guess the next steps are,

  • Figure out the encoding scheme for dsr info
  • Look into edge cases around where dsr info might be missing and the implications thereof

Since templated regions might expand or contract between parses, the encoding should probably include both dsr[0] and dsr[1]. Contractions that expose nodes could result in additional dirty diff'ing, but shouldn't produce corruption.

Change 512228 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Remove cite-specific leak from cleanup

https://gerrit.wikimedia.org/r/512228

Change 512228 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove cite-specific leak from cleanup

https://gerrit.wikimedia.org/r/512228