So, I now have an idea for id assignment that can, in the limited case of VE (and other editing client) edits, enable us to reuse data-parsoid from a non-matching render for the same revision. Id assignment across the document itself won't be stable and that cannot be guaranteed (because wikitext). But, since html2wt only cares about the top-level page, and since top-level page wikitext itself doesn't change for a revision, all we need to do is ensure that id assignment for HTML nodes coming from the top-level page is identical across parses. And, one such mechanism is to use wikitext-offset as the id which we already compute and is present in data-parsoid.dsr. So, instead of using a single counter for the entire document, we use multiple id assignment schemes, (a) one for top-level page content that is based on wikitext offsets, and (b) another counter with a different prefix for all other encapsulated content. Of course, if we use wikitext offsets, the ids can become large (imagine an id like X_54375), so we still need to think of some compression scheme .. for example by mapping the offsets to a different counter, [ 0 => 1, 3 => 2, 20 => 3, ...., 54375 => 38, .. ] and so on. Details need to be worked out, but if we can do this, then we can make selser work even when the exact render of the revision isn't available. This is not something we will work on now, but, something we could implement. This will also need a HTML version bump when we implement this.
For the most part, this corresponds with the code here,
The checks there can be replaced with Util.isValidDSR(dp.dsr). Multiple id assignment schemes don't seem necessary.
You may have DSR offsets for template source which may match DSR offsets for top-level content and can lead to duplicate ids. So you have to account for that .. hence the multiple id schemes thought.
So, that is good reg template content & dsr. But, there is still one other area that might necessitate the need for different scheme for id assignment for non top-level content. If we ever assign data-mw to content in templates that doesn't come from the top-level page (ex: media), and since we'll ge be removing inline data-mw, they will need id assigned to them and those will need a different scheme than top-level content.
If there's a solution to assign stable ids to the nested content, why not just use that for the entire page?
In other words, if we have to account for external data-mw, this doesn't work.
We don't have a stable id solution for the entire page (because templates can change arbitrarily across parses of the same top level revision). This is a solution for us to be able to use selser even when we don't have a stable id assignment solution.
In other words, for us to be able to use selser reliably, we only need to guarantee stable id assignment of content nodes that are present on the top-level page. All other nodes can use a different id assignment scheme.
Since templated regions might expand or contract between parses, the encoding should probably include both dsr and dsr. Contractions that expose nodes could result in additional dirty diff'ing, but shouldn't produce corruption.