Page MenuHomePhabricator

Create ContentHolder interface
Closed, ResolvedPublic

Description

In order to pass HTML as a parsed DOM tree from Parsoid to core while allowing compatibility with legacy users in core which expect an HTML string, we should create a ContentHolder interface. A ContentHolder holds either a parsed DOM or HTML string and tries to avoid unnecessary serialization and reparsing when different clients of ContentHolder are chained together.

Further discussion in https://www.mediawiki.org/wiki/Parsoid/OutputTransform/ContentHolder

Related Objects

StatusSubtypeAssignedTask
OpenNone
Openihurbain
Openihurbain
Resolvedihurbain
OpenNone
OpenNone
Resolvedihurbain
Opencscott
Resolvedihurbain
Resolvedihurbain
Resolvedcscott
Resolvedihurbain
Resolvedcscott
Opencscott
OpenBUG REPORTNone
Openihurbain
Resolvedihurbain
Resolvedihurbain
Openihurbain
Resolvedcscott
Openihurbain
OpenNone
ResolvedPRODUCTION ERRORcscott
Openihurbain

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Adding T346829 is a subtask since the HtmlHolder interface needs to be able to serialize/deserialize itself in order to be stored in ParserCache. Technically we could probably get away without this using explicit serialization/deserialization code in ParserOutput, but (as described in the HtmlHolder proposal) ideally we would like to be able to customize the on-disk representation for fast access and use independent from the details of the HTML string/DOM model formats defined by the HtmlHolder abstraction.

MSantos triaged this task as High priority.Oct 5 2023, 2:27 PM

This task has been partially subsumed by T374616: A flexible fragment type for transclusions which includes separate HtmlPFragment and DomPFragment types and methods to convert between them.

Remaining work:

  • The HtmlHolder interface was envisioned to hold a full Document; the fragment types from T374616 hold DocumentFragments only, which are disconnected from the Document.
  • The HtmlHolder interface was designed to cache the "last requested form" of the document, so that repeated string-to-string transforms (or repeated DOM-to-DOM transforms) did not require serialization/parse in between. The fragment types mostly provide this, since methods are provided to cast to a specific PFragment type, which is a no-op if the fragment is already of the requested type.

The output transform pipeline is a little bit unusual in that ParserOutput::getRawText() contains a complete Document for Parsoid, but that Document is almost immediately stripped to just the body contents by the first parsoid-only stage in the pipeline. For legacy content, ::getRawText() always contains only the body content.

It is likely that the solution here is to replumb OutputTransform to accept and return a PFragment at each stage, but there will still be some impedance mismatch at the entry to the pipeline where the Parsoid Document needs to be converted to a DocumentFragment representing only its body children.

Change #1148388 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Introduce ContentHolder

https://gerrit.wikimedia.org/r/1148388

Change #1148388 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/core@master] Introduce ContentHolder

Reason:

in favor of Ia2638e3e692ce5ec3b805384c388d1a7786b8a31

https://gerrit.wikimedia.org/r/1148388

ihurbain renamed this task from Create HtmlHolder interface to Create HtmlHolder / ContentHolder interface.Jul 1 2025, 2:51 PM
ihurbain claimed this task.

Assigning to Scott during my sabbatical to finish handling the ParserOutput/ContentHolder integration.

cscott renamed this task from Create HtmlHolder / ContentHolder interface to Create ContentHolder interface.Sep 12 2025, 4:29 PM
cscott updated the task description. (Show Details)

Change #1163428 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Back ParserOutput with a ContentHolder rather than rawText+extensiondata

https://gerrit.wikimedia.org/r/1163428

Change #1187007 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ContentHolder: normalize argument names, add default argument values

https://gerrit.wikimedia.org/r/1187007

Change #1187008 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Use ContentHolder to implement/simplify ContentDOMTransformStage

https://gerrit.wikimedia.org/r/1187008

Change #1163428 merged by jenkins-bot:

[mediawiki/core@master] Back ParserOutput with a ContentHolder rather than rawText+extensiondata

https://gerrit.wikimedia.org/r/1163428

Change #1187007 merged by jenkins-bot:

[mediawiki/core@master] ContentHolder: normalize argument names, add default argument values

https://gerrit.wikimedia.org/r/1187007

Change #1187008 merged by jenkins-bot:

[mediawiki/core@master] Use ContentHolder to implement/simplify ContentDOMTransformStage

https://gerrit.wikimedia.org/r/1187008