Page MenuHomePhabricator

Parsoid should generate the <head> on the core side, from the ParserOutput metadata
Open, Needs TriagePublic

Description

The <head> section of the document produced by Parsoid is only used by the REST APIs, and ought to live in core as part of the REST API implementation rather than Parsoid.

Rendering the <head> based on the metadata in the ParserOutput would allow us to:

  • Remove the need to store different ParserOutput contents for legacy (which is just <body> innerHTML) and Parsoid (which is the entire <html> document, reducing maintainer confusing and elimininating unnecessary ParserCache storage for the common case (read views)
  • Eventually allow us to implement a more-complete mapping of ParserOutput information into <head> metadata, for those who need that -- for example, REST API users who do want an all-in-one document and who want access to category and other information which is stored in the ParserOutput but only imperfectly reflected in the output HTML. (For example, T418792: Expose MediaWiki Parser render_id as a response header in relevant MW REST API endpoints.)

To make this change without breaking existing endpoint, we're going to start by having ContentHolder track whether it is currently holding full-document content or not, and add a new ContentHolder method, getRawContentHolderText(), which will be used by any endpoint which *wants a full document result*. We'll start with an audit of all codepaths which uses ParserOutput, and shift each one which actually wants a full document to getRawContentHolderText() to document and distinguish this case. Our initial audit of these users:

Sites which should use ::getRawContentHolderText() as they expect full-document output
SiteNotes
Rest/Handler/PageHTMLHandler.php:113html + with_html; shared by renderer and shadow helper paths
Rest/Handler/RevisionHTMLHandler.php:85html output
Rest/Handler/RevisionHTMLHandler.php:90with_html output
Rest/Handler/ParsoidHandler.php:781wt2html FORMAT_HTML; trickiestedit/fragment flavor logic
Sites which want body-only content; they will continue to use ::getContentHolderText()
SiteWhy body-only is correct
Status/StatusFormatter.php:347message HTML → text
Language/Message/Message.php:1075,1078message HTML, stripOuterParagraph
Page/ParserOutputAccess.php:908post-pipeline; appends <!--debug--> comment to body-only text (also a write site)
Content/WikiTextStructure.php:158search indexing; parses body to DOM, extracts text
Revision/RevisionRenderer.php:278concatenates per-slot body HTML
Api/ApiParse.php:502,513action API returns body HTML
Output/OutputPage.php:2573,2590,2676,2708,2732page view; skin wraps body content
Api/ApiQueryRevisionsBase.php:688runOutputPipeline(...)->… (pipeline ran)
EditPage/EditPage.php:3458,3464diff of body content (3464 runs pipeline)
Specials/SpecialExpandTemplates.php:113->run(...)->… (pipeline ran)
Specials/SpecialRecentChanges.php:473RC body content
JobQueue/Jobs/RefreshLinksJob.php:459equality compare of cached vs fresh output; both normalize identically
Installer/Installer.php:766->getContentHolderText() after pipeline
Actions/McrUndoAction.php:343pipeline output
Actions/InfoAction.php:197info page body
FileRepo/File/LocalFile.php:2622runOutputPipeline(...)->…
Parser/ParserOutput.php:1111setText() deprecated back-compat (returns previous value)

We also want to maintain the output of these external API endpoints:

EndpointExpected shape
GET /v1/page/Main_Page/htmlfull document
GET /v1/page/Main_Page/with_htmlJSON, html = full document
GET /v1/revision/{id}/htmlfull document
GET /v1/revision/{id}/with_htmlJSON, html = full document
POST /v1/transform/wikitext/to/html/Main_Pagefull document (edit flavor)
POST …/to/html/Main_Page?body_only=truebody-only fragment
GET /api.php?action=parse&page=Main_Page&parser=parsoid&formatversion=2body-only in .parse.text
GET /api.php?action=parse&page=Main_Page&parser=legacy&formatversion=2body-only in .parse.text
GET /api.php?action=visualeditor&page=Main_Page&paction=parse&formatversion=2full document in .visualeditor.content
POST /localhost/v3/transform/wikitext/to/pagebundle/Main_Page (Parsoid extension)full document in .html.body
POST /localhost/v3/transform/wikitext/to/html/Main_Page (Parsoid extension)full document
POST /v1/transform/wikitext/to/html/Main_Page?stash=true (requires auth)full document + stash-key ETag
POST /v1/transform/html/to/wikitext/Main_Page w/ If-Match: <stash ETag> (requires auth)recovers original wikitext (selser)

We expect that passing an Accept-Language header to invoke language conversion will maintain the same output shape (full document or body-only):

Endpoint (+ Accept-Language: en-x-piglatin)Expected shape
GET page/html, page/with_html, revision/html, revision/with_htmlfull document
POST transform wikitext→html (edit)full document
POST transform wikitext→html ?body_only=truebody-only (T428485)
POST v3 pagebundle, v3 wikitext→htmlfull document

Implementation plan

The transition will be made in several steps. At the end, the ContentHolder will only hold body-only content, and any client which wants a full document will access it via an HtmlPageBundle. The conversion from ParserOutput to HtmlPageBundle will add the <head> and <body> wrappers.

  1. Track full-document status in ContentHolder; introduction of ContentHolder::getAsRawHtmlString() to mark users who expect a full document. No behavior change yet, but we can add assertions to test that our classifications are correct.
  2. ContentHolder::getAsHtmlString(BODY_FRAGMENT) will now strip the result to always return body-only content (regardless of whether the raw html is full-document or not). Anything which needed the full document should be using ::getAsRawHtmlString().
  3. Deprecate ::getAsRawHtmlString() and incrementally switch clients who are using it to HtmlPageBundle using htmlPageBundleFromParserOutput with the asFullDocument flag.
  4. Once there are no more users of ::getAsRawHtmlString(), we can flip Parsoid to start storing body_only content in the ParserCache. We still need the on-demand strip in step #2 because old cache contents will still be body_only.
  5. Once the ParserCache expiration time has expired, we can remove the lazy strip and/or strip during deserialization.

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenNone
Resolvedssastry
Resolvedssastry
Resolvedcscott
Resolvedcscott
Resolvedcscott
OpenNone
Resolvedmatmarex
Resolvedcscott
Resolvedcscott
ResolvedNone
Resolvedssastry
Resolvedcscott
Resolvedcscott
Resolvedcscott
Resolvedcscott
ResolvedPRODUCTION ERRORLucas_Werkmeister_WMDE
ResolvedPRODUCTION ERRORabi_
ResolvedPRODUCTION ERRORcscott
ResolvedPRODUCTION ERRORroman-stolar
ResolvedNone
ResolvedNone
Resolvedmatmarex
Resolvedmatmarex
Resolvedcscott
Resolvedcscott
Resolvedcscott
Resolvedihurbain
ResolvedNone
Resolvedcscott
ResolvedNone
Resolvedssastry
Declinedssastry
ResolvedPRODUCTION ERRORssastry
Resolvedcscott
Resolvedssastry
ResolvedMSantos
Resolvedihurbain
OpenNone
Resolvedcscott
DeclinedNone
DeclinedNone
DeclinedNone
DeclinedNone
Resolvedppelberg
ResolvedBUG REPORTNone
ResolvedPRODUCTION ERRORJdlrobson-WMF
ResolvedEsanders
OpenNone
StalledNone
OpenNone
OpenNone
OpenBUG REPORTNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
DuplicateNone
Resolvedmatmarex
Resolvedmatmarex
Resolved mobrovac
Resolved mobrovac
Resolved mobrovac
OpenNone
Resolvedssastry
Opencscott
OpenNone
Resolvedcscott
ResolvedABreault-WMF
Resolvedcscott
Opencscott
Resolvedssastry
ResolvedJgiannelos
OpenJgiannelos
OpenJgiannelos
OpenJgiannelos
OpenJgiannelos
ResolvedJgiannelos

Event Timeline

Request from @Ottomata is to include limit report data, including cache key info, as well. Basically a version of the RenderDebugInfo stage, but putting it into a <script> tag in the <head> or something like that.

I don't understand all the internals, but for our use cases (and occasional debugging), having this information in MW REST API response headers would be nice. We'd prefer not to have to parse the html content for metadata about the rendering.

Related? T418792: Expose MediaWiki Parser render_id as a response header in relevant MW REST API endpoints

Change #1271913 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] WIP: Create <head> from metadata

https://gerrit.wikimedia.org/r/1271913

Change #1298964 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserOutput: Add @internal raw HTML escape valve for body-only migration

https://gerrit.wikimedia.org/r/1298964