Page MenuHomePhabricator

Parsoid.php entry points should accept PageBundles for (html/dom)2wikitext
Open, MediumPublic

Description

Parsoid's REST API endpoints currently do some unnecessary HTML string -> DOM object -> HTML string conversions to satisfy Parsoid.php's interface (which in turn was a porting carryover from Parsoid/JS where we had to serialize DOM to string before handing it off to a worker process). But, in Parsoid/PHP world, this DOM -> HTML -> DOM conversion is just an inefficiency which we can purge by having Parsoid,php entry points accept DOM as well.

Note that there are a number of slightly different representations we use for both HTML strings (inline data-mw attributes vs data-mw in JSON blob in <head> vs separate JSON blob) and DOM (parsed versions of the different HTML variants, plus the 'internal' version where data-mw is stored in a separate Bag hanging off the DOM). We need to be careful to distinguish which of these is our I/O type.

Event Timeline

ssastry triaged this task as Medium priority.Oct 6 2020, 6:22 PM
ssastry moved this task from Needs Triage to Tech Debt / Big changes on the Parsoid board.
ssastry moved this task from Tech Debt / Big changes to Performance on the Parsoid board.

I think our entry points should actually take PageBundles, and we should have factory methods to create PageBundles efficiently from DOM without serializing to string.

The difference is that we have a number of different ways we represent embedded data-parsoid/data-mw, and the PageBundle factories should abstract away over those details. So you call the correct factory method depending on how data-parsoid/data-mw is represented, and then we eventually normalize the PageBundle into the correct DOM-with-Bag internal representation or one of the various external representations.

More discussion: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/605956/4#message-d8c218222404b3e90f6fe0d154ca283d4c2db8f4

cscott renamed this task from Parsoid.php entry points should accept DOM as well as HTML to Parsoid.php entry points should accept DOM objects as well as HTML strings.Oct 6 2020, 7:42 PM
cscott updated the task description. (Show Details)

I think our entry points should actually take PageBundles, and we should have factory methods to create PageBundles efficiently from DOM without serializing to string.

Possibly. That would also address Arlo's dismay with my "as well as" phrasing in the title and description.

Change 632576 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] Pass dom to entrypoint instead of string

https://gerrit.wikimedia.org/r/632576

Change 632576 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Pass dom to entrypoint instead of string

https://gerrit.wikimedia.org/r/632576

We can repurpose the task for switching the entrypoint to accept PageBundles

Arlolra renamed this task from Parsoid.php entry points should accept DOM objects as well as HTML strings to Parsoid.php entry points should accept PageBundles for (html/dom)2wikitext.Oct 20 2020, 10:23 PM
Arlolra removed Arlolra as the assignee of this task.
Arlolra subscribed.

Change 638161 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a14

https://gerrit.wikimedia.org/r/638161

Change 638161 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a14

https://gerrit.wikimedia.org/r/638161