ParsoidHandler and HTMLTransformInput does multiple parses and serialization of the input HTML or PageBundle HTML to get the DOM and back to HTML (respectively) when we should really only parse the HTML once and cache the document, manipulate it as much as needed and reuse later if/when we need it.
The DOM is what we apply modifications to so that should be passed around and data-parsoid / data-mw applied along with the correct content version rather than re-parsing the HTML every time we need it.
Investigation (multi-parse and serialize)
- Parsing and serialization happens when we request a downgrade of the original page bundle to a specified content version, see relevant code below
$pageBundle->html = $newPageBundle->toHtml(); (in Parsoid.php) calls
public function toHtml(): string {
$doc = DOMUtils::parseHTML( $this->html );
self::apply( $doc, $this );
return ContentUtils::toXML( $doc );
}(in PageBundle.php) which parses and serializes when applying 999 -> 2 downgrade.
- Another parsing happens in Parsoid.php still in the downgrade() method:
$doc = DOMUtils::parseHTML( $pageBundle->html ); parses the HTML in the page bundle and then serializes it in L467
$pageBundle->html = ContentUtils::toXML( $doc );
- Another parsing happens when we get the originalBody on $this->originalBody = DOMCompat::getBody( $this->parseHTML( $pb->html ) ); in HTMLTransformInput/HTMLTransform but no serialization happens here since we cache an input Element.
- Finally, the original caller getOriginalBody() does another parsing $doc = $this->parseHTML( $pb->html ); .
That calling getOriginalBody() parses the original HTML 4 times and serializes it 2 times. Is that really necessary?