Page MenuHomePhabricator

(Optimization) Investigate and avoid multiple parsing of input HTML
Open, LowPublic

Description

ParsoidHandler and HTMLTransformInput does multiple parses and serialization of the input HTML or PageBundle HTML to get the DOM and back to HTML (respectively) when we should really only parse the HTML once and cache the document, manipulate it as much as needed and reuse later if/when we need it.

The DOM is what we apply modifications to so that should be passed around and data-parsoid / data-mw applied along with the correct content version rather than re-parsing the HTML every time we need it.

Investigation (multi-parse and serialize)

  • Parsing and serialization happens when we request a downgrade of the original page bundle to a specified content version, see relevant code below

$pageBundle->html = $newPageBundle->toHtml(); (in Parsoid.php) calls

public function toHtml(): string {
	$doc = DOMUtils::parseHTML( $this->html );
	self::apply( $doc, $this );
	return ContentUtils::toXML( $doc );
}

(in PageBundle.php) which parses and serializes when applying 999 -> 2 downgrade.

  • Another parsing happens in Parsoid.php still in the downgrade() method:

$doc = DOMUtils::parseHTML( $pageBundle->html ); parses the HTML in the page bundle and then serializes it in L467

$pageBundle->html = ContentUtils::toXML( $doc );
  • Another parsing happens when we get the originalBody on $this->originalBody = DOMCompat::getBody( $this->parseHTML( $pb->html ) ); in HTMLTransformInput/HTMLTransform but no serialization happens here since we cache an input Element.
  • Finally, the original caller getOriginalBody() does another parsing $doc = $this->parseHTML( $pb->html ); .

That calling getOriginalBody() parses the original HTML 4 times and serializes it 2 times. Is that really necessary?

Event Timeline

Brain dump:

@ssastry, @MSantos, can you confirm that the investigation above is indeed the case? Passing down a document object could have been fairly easier since at the end of the day, we need a document instead of a page bundle (when we call getOriginalBody() or at least make it in a way that we will parse once and serialize once instead of multiple times.

Changing the interfaces of downgrade() and downgrade999to2(), etc to start accepting a document object seems like a not so straight forward thing to do now?

I also see that PageBundle is the DTO that is used internally in the context of client fetch (from server), receive and send (back to server) cycle and if we change the interfaces mentioned above to accept a document, the cycle (fetch, receive and send) may have to change a bit or maybe this is really needed only for the downgrade logic on server side?


Okay, thinking about this a little bit more, we will need both the page bundle and the corresponding document for this to work. Downgrade really happens on the document object then wrapped in the page bundle as the DTO (from what I understand).

So I'm thinking that in HTMLTransformInput, if there is no original body object available and there is original HTML supplied (L#267), we can get the original page bundle, parse it to get the corresponding document and pass both into the downgrade logic with the corresponding target content version.

But the problem is that we will have to make Parsoid::downgrade() to support both page bundle and document as input which will intend affect how the actual downgrade999to2 works since it will have to support both document and page bundle as well.

Change 821251 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/services/parsoid@master] Make Parsoid::downgrade() to support Document & PageBundle

https://gerrit.wikimedia.org/r/821251

Change 821260 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] Avoid multiple parses of PageBundle HTML in downgrade path

https://gerrit.wikimedia.org/r/821260

Ya, there are definitely multiple html parses and serializations on the downgrade path. We have never really exercised the downgrade functionality in Parsoid yet (except for tests), so we never really got around to looking at performance and optimizing it. But, yes, the interfaces can definitely be improved to reduce extra parse and serialization steps.

Change 822062 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/services/parsoid@master] EXP: Enable PageBundle to support a DOMDocument

https://gerrit.wikimedia.org/r/822062

daniel triaged this task as Medium priority.Sep 5 2022, 9:22 AM

I'd love to have an HtmlContainer class. Something like this:

class HtmlContainer {
  function getHtml();
  function getDocument();
  function getElement();

  function setHtml(); // maybe?!
  function domUpdated(); // After the DOM was modified
}

The object would hold an HTML representation and a DOM structure. It could be initialized based on either one, by a factory. The missing representation would be constructed lazily, when needed.

Methods that currently take an HTML string or a Document could in the future take an instance of HtmlContainer OR whatever they previously accepted, and construct the HtmlContainer as needed.

daniel lowered the priority of this task from Medium to Low.Jan 9 2023, 9:51 AM

Change 821251 abandoned by D3r1ck01:

[mediawiki/services/parsoid@master] [WIP] Make Parsoid::downgrade() to support Document & PageBundle

Reason:

https://gerrit.wikimedia.org/r/821251

Change 822062 abandoned by D3r1ck01:

[mediawiki/services/parsoid@master] [WIP] EXP: Enable PageBundle to support a DOMDocument

Reason:

https://gerrit.wikimedia.org/r/822062

Change 821260 abandoned by D3r1ck01:

[mediawiki/core@master] [WIP] Avoid multiple parses of PageBundle HTML in downgrade path

Reason:

https://gerrit.wikimedia.org/r/821260