Page MenuHomePhabricator

VisualEditor should request Parsoid HTML with ?stash=true query parameter
Closed, ResolvedPublic

Description

After simplifying storage for Parsoid HTML, we will only guarantee that a matching data-parsoid is stored if a stash=true query parameter is provided by the visual editor. In case it's not, the content is returned but a corresponding data-parsoid might be missing once the modified revision is submitted for a transform.

Event Timeline

Pchelolo renamed this task from VisualEditor should request Parsed HTML with ?stash=true query parameter to VisualEditor should request Parsoid HTML with ?stash=true query parameter.May 6 2019, 4:34 PM

@Pchelolo: this doesn't just affect VE but all Parsoid clients that might fetch html for editing and might then post edited html for save. So, couple of comments (a) this should actually be documented in the API docs (swagger?) alongwith the implications of what happens if a client doesn't comply ... which brings me to the next point (b) if data-parsoid is missing, we should make sure RESTBase sends data that Parsoid serializes correctly ... the output should still be correct but it may be normalized (and hence cause a dirty diff).

(a) this should actually be documented in the API docs (swagger?) alongwith the implications of what happens if a client doesn't comply

That will be done in a PR that I'm working on.

(b) so, should we send no data-parsoid entirely, or send 'whatever we have even if not matching'?

(b) so, should we send no data-parsoid entirely, or send 'whatever we have even if not matching'?

Please do not send "whatever you have for that revid" because parses and id assignment isn't stable across runs on the same wikitext (because of wikitext semantics as we've discussed in the past).

@Arlolra can confirm but if you skip data-parsoid, Parsoid does regular normalizing serialization.

@ssastry @Arlolra so the TL;DR is: if we have the exact render data-parsoid, send it, otherwise just send the modified HTML. Is that assessment correct? Note that if we don't have the exact render data-parsoid, we will not have the exact HTML render either to send.

Change 508518 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/VisualEditor@master] ArticleTargetLoader: Add stash=true query param to RB HTML fetching

https://gerrit.wikimedia.org/r/508518

@ssastry @Arlolra so the TL;DR is: if we have the exact render data-parsoid, send it, otherwise just send the modified HTML. Is that assessment correct?

Yes, if you have exact render data-parsoid send (a) modified html (b) matching old html + data-parsoid ... this is what you do now.
If you don't have exact render data-parsoid, only send modified HTML and nothing else

Note that if we don't have the exact render data-parsoid, we will not have the exact HTML render either to send.

If you don't have data-parsoid, then old HTML is useless since old html + matching data-parsoid is used for selective serialization. If not, just modified HTML is used for regular normalizing serialization.

So, I now have an idea for id assignment that can, in the limited case of VE (and other editing client) edits, enable us to reuse data-parsoid from a non-matching render for the same revision. Id assignment across the document itself won't be stable and that cannot be guaranteed (because wikitext). But, since html2wt only cares about the top-level page, and since top-level page wikitext itself doesn't change for a revision, all we need to do is ensure that id assignment for HTML nodes coming from the top-level page is identical across parses. And, one such mechanism is to use wikitext-offset as the id which we already compute and is present in data-parsoid.dsr. So, instead of using a single counter for the entire document, we use multiple id assignment schemes, (a) one for top-level page content that is based on wikitext offsets, and (b) another counter with a different prefix for all other encapsulated content. Of course, if we use wikitext offsets, the ids can become large (imagine an id like X_54375), so we still need to think of some compression scheme .. for example by mapping the offsets to a different counter, [ 0 => 1, 3 => 2, 20 => 3, ...., 54375 => 38, .. ] and so on. Details need to be worked out, but if we can do this, then we can make selser work even when the exact render of the revision isn't available. This is not something we will work on now, but, something we could implement. This will also need a HTML version bump when we implement this.

Yes, if you have exact render data-parsoid send (a) modified html (b) matching old html + data-parsoid ... this is what you do now.
If you don't have exact render data-parsoid, only send modified HTML and nothing else

If you don't have data-parsoid, then old HTML is useless since old html + matching data-parsoid is used for selective serialization. If not, just modified HTML is used for regular normalizing serialization.

ack

So, I now have an idea for id assignment that can, in the limited case of VE (and other editing client) edits, enable us to reuse data-parsoid from a non-matching render for the same revision. Id assignment across the document itself won't be stable and that cannot be guaranteed (because wikitext). But, since html2wt only cares about the top-level page, and since top-level page wikitext itself doesn't change for a revision, all we need to do is ensure that id assignment for HTML nodes coming from the top-level page is identical across parses. And, one such mechanism is to use wikitext-offset as the id which we already compute and is present in data-parsoid.dsr. So, instead of using a single counter for the entire document, we use multiple id assignment schemes, (a) one for top-level page content that is based on wikitext offsets, and (b) another counter with a different prefix for all other encapsulated content. Of course, if we use wikitext offsets, the ids can become large (imagine an id like X_54375), so we still need to think of some compression scheme .. for example by mapping the offsets to a different counter, [ 0 => 1, 3 => 2, 20 => 3, ...., 54375 => 38, .. ] and so on. Details need to be worked out, but if we can do this, then we can make selser work even when the exact render of the revision isn't available.

Correct. I was thinking something in the same vein - assign top-page IDs before expanding templates and go for a second round afterwards. I'm not sure I understand the reasoning behind the mapping, but one simple trick might be to use something like a 128-base system, which can then map to printable ASCII chars. But, that's probably off-topic for the task at hand.

This is not something we will work on now, but, something we could implement. This will also need a HTML version bump when we implement this.

It would be useful to have it now for the transition period during our storage simplification project (for on-going edits), but this might come useful later during the Parsoid switch from JS to PHP too.

mobrovac triaged this task as High priority.

@Pchelolo @Esanders Gerrit 508518 is ready to go now. Given that the RESTBase side for supporting this is already in place in production, could we SWAT this on Monday? Doing so would allow us to continue with deploying the next step for RESTBase.

Change 508518 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] ArticleTargetLoader: Add stash=true query param to RB HTML fetching

https://gerrit.wikimedia.org/r/508518

Change 508867 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/extensions/VisualEditor@wmf/1.34.0-wmf.4] ArticleTargetLoader: Add stash=true query param to RB HTML fetching

https://gerrit.wikimedia.org/r/508867

Change 508867 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@wmf/1.34.0-wmf.4] ArticleTargetLoader: Add stash=true query param to RB HTML fetching

https://gerrit.wikimedia.org/r/508867

Mentioned in SAL (#wikimedia-operations) [2019-05-13T11:43:54Z] <reedy@deploy1001> Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/: T222639 (duration: 00m 52s)

The patch has been SWATted, so now VE provides an appropriate query parameter and the responses are not cached. This particular one is done. Resolving.