See step 1 in https://www.mediawiki.org/wiki/Parsoid/OutputTransform
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Back ParserOutput with a ContentHolder rather than rawText+extensiondata | mediawiki/core | master | +357 -294 |
Event Timeline
This is a significant cost in real-life uses.
@ssastry : https://performance.wikimedia.org/excimer/profile/623e6e27ff8a6067 shows that post-parseroutput transform takes ~9secs for Parsoid. And that is because we are repeatedly doing html -> dom -> html on that ginormous doc. The actual transforms are few 100 ms at most.
@ssastry: That is for the enwiki:Barack_Obama page.
@ssastry: Should not be too hard to do - but I consider this a high priority blocker for any significant deploys.
@cscott: Yeah, avoiding repeated html->dom->html has been on our to-do list. If we store it as page bundle instead of inline-attribute that makes things a lot faster as well, we can just keep the data-* as an array instead of de/serializing back and forth to JSON. (Of course this is even better if data-mw is in the page bundle.)
@cscott: Some of the output transforms should be converted from string to DOM as well, since we can't really avoid a lot of string-to-DOM-and-back if adjacent pipeline stages are written to use different formats. This shouldn't be hard, most of the pipeline stages don't do a lot.
Current default pipeline is:
- [text] ExtractBody -- uses Html::modifyHtml
- [text] AddRedirectHeader
- [text] RenderDebugInfo
- [dom] ParsoidLocalization
- [text] ExecutePostCacheTransformHooks
- [text] AddWrapperDivClass
- [text] HandleSectionLinks / [dom] HandleParsoidSectionLinks
- [text] HandleTOCMarkers
- [text] DeduplicateStyles -- use Html::modifyElements
- [text] ExpandToAbsoluteUrls
- [text] HydrateHeaderPlaceholders
- [text] HardenNFC
From @ssastry's stack trace, we're doing the following parses:
- ExtractBody (in Html::modifyElements)
- ParsoidLocalization (including data attributes)
- HandleParsoidSectionLinks (including data attributes)
- DeduplicateStyles (Html::modifyElements)
A remex parse+serialize in Html::modifyElements takes 1s each, the with-data-attributes parse+serialize takes 3.5s (of which 1.14s is "visitAndStoreDataAttributes" alone which wouldn't be needed if we just kept the HTML in page bundle form).
1+3.5+3.5+1 = 9s (!)
An optimal ordering of stages, combined with tweaking ExtractBody and DeduplicateStyles to be DOM stages rather than text-using-Html::modifyElements would bring us to 3.5s. Switching from inline attributes to page bundle representation would likely save an additional 1-2s by avoiding the de/serialization to JSON. (edited)
(Worth noting that parseHtml is 0.9s, XMLSerializer is 0.3s, while Html::modifyElement is 1.0s. The hoops we jumping through with Remex to avoid actually creating a DOM for a text-mode transformation saves about 0.2s each.
Also worth noting that we could improve our latency on "first parse to first view" of this page by keeping around the original DOM of the page generated by Parsoid instead of serializing to ParserCache and then re-parsing it.)
@ihurbain notes that we never do [dom] passes on legacy HTML and " my vague plan at the back of my head has always kind of been to implement DOM versions of the text transforms and to have a full* DOM pipeline for Parsoid parser output
*caveat for the extension postprocess pass for which we don't really have control over what's happening there, technically)"
DiscussionTools uses that extension postprocess hook I believe, but we have T376183: Create new AfterArticleParse hook for Discussion Tools for that.
And while it's true that Html::modifyHtml saves 200ms over doing a "real" HTML parse-and-reserialize, using Html::modifyHtml *twice* is slower (2s) than doing one HTML parse, both passes on DOM form, then one HTML serialize (1.2s). The legacy parser is only doing a single Html::modifyHtml at the moment (in DeduplicateStyles), but maintaining the output in DOM might turn out to be faster even for the legacy parser if 200ms savings can be found on DOM-based operations.
I think creating a HTMLHolder or ContentHolder is orthogonal to what I am recommending.
A output transform pipeline should pin its input format and not have stages deal with format conversions (even if it is simply getContentHolderText() or getContentHolderDOM()). If you allow pipeline stages to get content in whatever format they wanted, you have to enforce good practice and performance by convention and code review. From the POV of blockers, that works for me, but that design is suboptimal. If you force pipeline stages to deal with a fixed format, you have a DOMOutputTransformPipeline and a TextOutputTransformPipeline and stage membership is enforced by types and developers and reviewers don't have to worry about convention and it is self-documenting and prevents performance problems / regressions.
You can always chain a DOM and Text Pipelines optimally for the cases where you get a HTML string out of the Parser Cache and need to emit a HTML string in the response. And, your transform stages need to fit in one of those pipelines.
If you really wanted to, you can do a ParserCache -> TextPipeline -> html2dom -> DOMPipeline -> dom2html -> TextPipeline -> http response ... and it still does exactly one html2dom and one dom2html, and have 3 pipelines to pick the most optimal place and format for your output transform (text or DOM). And/or you can have both text/dom versions of some transformations for cases where your input is coming from Parsoid as a DOM rather than ParserCache. Or, you skip the first TextPipeline.
So, the OutputTransformPipeline will actually be chaining of pipelines ... this is similar to what happens in Parsoid. We have a ParserPipeline which is composed of TokenHandlerPipeline and a DOMProcessorPipeline and other stages.
Anyway, some thoughts for consideration.
Yes, but the whole point of ContentHolder was to avoid having to boil the ocean in a single day. We have existing passes out there, and existing users of text-based passes via the extension postprocess hook (including DiscussionTools, and ParserOutput serialization/deserialization). By creating an abstraction we can work toward the goal of moving everything to DOM without having to do it all at once. And, as @ihurbain notes, the existing pipelines we have to work with are very sensitive to stage order (unfortunately!) so "just" moving all the DOM-based passes to the front/back isn't an easy solution.
ContentHolder is an incremental way of getting to an all-DOM promised land, which we all agree is where we'd like to eventually end up.
I am not complaining about the incremental approach. All I am saying is: to deal with performance concerns, you have to eliminate unnecessary format conversions. That means your passes will (have to) fall into format-aligned buckets which you can use to make format a pipeline property.
Change #1163428 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):
[mediawiki/core@master] Back ParserOutput with a ContentHolder rather than rawText+extensiondata
Change #1163428 merged by jenkins-bot:
[mediawiki/core@master] Back ParserOutput with a ContentHolder rather than rawText+extensiondata