Page MenuHomePhabricator

ParserOutput::getText() should be removed from ParserOutput
Open, HighPublic

Description

There is a bunch of 'parse rendering' code in ParserOutput::getText() which has no place in (what we hope will be) a plain value object.

Further, getText() operates directly on the HTML as a string; in the Parsoid world it should be operating (and returning) a DOM tree instead.

Related methods: ParserOutput::addWrapperDivClass(), ::clearWrapperDivClass().

Perhaps the "postprocessing" step here belongs somewhere in the Content hierarchy?

Event Timeline

Some discussion copied from slack:

@cscott: The key refactor there is ParserOutput::getText() https://phabricator.wikimedia.org/T293512 which doesn't belong there -- it does a bunch of content post-processing that shouldn't belong in a "dumb array of keys".
@ihurbain: and where would that belong, ultimately?
@cscott: Not sure! The phab task speculates "somewhere in the Content* hierarchy"? But maybe we'll want a Postprocess.php class on par with Parser.php? Or else this goes into OutputPage, which generally contains that sort of thing. That's a big question mark. But it's definitely "not in ParserOutput".

Later:

@cscott: subbu do you have any thoughts on what the various "post process the output of ParserCache" methods will look like, API-wise?
@cscott: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/749284 in particular has a method which wants to post process the raw HTML for indicators stored in the ParserCache
@ssastry: Funny you should ask because I effectively added a closely related agenda item for Monday in the etherpad.
@cscott: If we're going to have a PostProcess.php class, maybe it makes sense to have a PostProcess::getIndicators(ParserOutput $po) which will do the necessary post processing on the indicator output (in this case, just wrapping it in a div)
@cscott: ParserOutput::getText() wants to move there, too, so it might be PostProcess::getContentHtml(ParserOutput $po) or something like that? But I don't particularly love the name.
@ssastry line 67-72 of the etherpad, but we can brainstorm here too.
@cscott: And I suspect that getContentHtml method should take some set of option keys to select "view mode HTML" or "editing mode HTML" or "mobile HTML" or whatever.

[...]

@cscott: As usual (?) I'm working bottom-up here: I have all this code in ParserOutput::getText() as well as this new wrapper functionality in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/749284 that I need to put somewhere.
@cscott: But maybe we should cut the knot and just create a PostProcess.php in core (bikeshed that name!), mark it @internal and experimental, and start hacking on those various pieces.

The indicator postprocessing discussed above ended up being done in OutputPage::addParserOutputMetadata(): https://gerrit.wikimedia.org/r/c/mediawiki/core/+/770666

We also have an l10n pass being written by @ihurbain for Parsoid which would also fall in the "post processing" camp; Parsoid redlink processing as well.

More discussion from slack:

@daniel By the way, I'm inclined to replace ?stash=true with a format, html_for_editing or some such. Or we could have both: html_for_editing has parsoid data, but if you combine it with stash, it has IDs. The stash param would do nothing with normal /html.
@cscott I think my preference is for a "format" option, because I think there will be (eventually) lots of different post processing combinations.
@cscott (Although I'd /like/ to avoid that.)
@cscott for example, "mobile HTML" is a thing
@cscott and "TOC included, yes/no" is already a thing
@cscott basically all the options in ParserOutput::getText() and then some are probably going to show up in the API endpoint at some point.
@cscott although one hopes that they are all coming from the same cached content/ParserOutput and are just various postprocessing done on it
@daniel ParserOutput::getText needs to be factored out urgently, before it accumulates more logic
@cscott Yes, I've even got a phab task for it
@cscott https://phabricator.wikimedia.org/T293512
@cscott any bikeshedding you want to do about the name/location/etc of the future "post processing parserOutput" class/hierarchy would be welcome. I've speculated that the Content hierarcy might be a good place, but I'm not really familiar with everything that lives there now.
@cscott I think my vague idea is that we might have a small number of different "format" options like "read", "edit", "mobile", "discussiontools" (!) which each map internally into a set of postprocessing options/steps like "discardDataParsoid", "no-toc", etc, to avoid exposing the entire postprocessing chain as part of the API.
@daniel Yes, that sounds reasonable.
@cscott The other option is just to allow an extensible number of "transforms" and ask for them explicitly. That is a bit more future-proof, in that it allows you to (say) transition the mobile app to no-toc content by having the latest version of the app explicitly request options=mobile,no-toc or whatever, while not breaking older versions of the app which expect toc inline in the HTML.
@daniel I'd not put the post processing in the content hierarchy. I'd prefer them to be orthogonal. The same post-processing could be applied to output generated from various kinds of content.
@daniel The thought experiment I'd make is "if someone wrote extensive markdown support for mediawiki, would they want this transformation to work on their output"?
@cscott maybe something under MediaWiki\Parser then?
@cscott Red link tagging is a good thought experiment, you'd probably want that to work on markdown.
@cscott and it's specific to mediawiki & requires database access, eg, so fits better in core than in Parsoid.
@daniel I'd want variant conversion to work on markdown too.
@daniel In my mind, these transformations have nothing to do with wikitext. They are html-to-html transformations. They are independent of the content type, and should be oblivious of whatever generates the HTML, or what from.
@cscott "strip infoboxes" is something that happens in mobile HTML, I guess that's something you'd do for markdown as well? It's really an artifact of the fact that we never fully pursued MCR so the infoboxes are still "inline" with the main article content instead of being a separate content type.
@daniel Not sure MCR would remove infoboxes completely from the content. The layout gets tricky if you want to do it generically, especially if there are many kinds of infobox on the same page
@cscott (i'll just say that "page layout" wants to be an MCR type of its own.)
@daniel I'd probably start a new top level namespace for "output transformation".
@cscott MediaWiki\OutputTransform ?
@daniel MediaWiki\OutputTransform sounds good to me
@ssastry ( oh ...lots of dicsussion here ... will catch up later) .. but MediaWiki\html2html is less cumbersome?
@ssastry but, i suppose that ties it to html as a format.
@cscott (it's also a bit tricky by the specific way we use "html" and "dom" in parsoid, where "html" is usually "html as a string" and "dom" is "html as a parsed DOM tree"; to be consistent I'd hope this would be "dom2dom")
@ssastry I will also add another variable into this space ... does a html2html transform always have to live as core code? Are they / can they be "micro"services?
@daniel html2html would also work for me. I want it to be tied to HTML as a format.
@daniel Other kinds of transformations are conceptually different.
@daniel DOMTransform also works for me. Anything that doesn't bind it to wikitext or parsing :slightly_smiling_face:
@ssastry scott dom/html is a detail can be combined in htm2html in my view.
@cscott the word "output" is more-or-less specific in mediawiki: ParserOutput/OutputPage/etc.
@daniel ParserOutputTransform would be nice and specific, but maybe overly so.
@cscott the api question is just "does the set of postprocessing transforms want to be exposed as a limited number of 'format's, or as an infinitely-extensible set of boolean flags"
@daniel for wikidata rdf exports, we invented a system of "flavors" for this purpose
@cscott (subbu doesn't like ParserOutputTransform, but I will submit that it is extremely meaningful to a Mediawiki core developer, and it would be 100% clear that ParserOutput::getText() should live there.)
@daniel I agree that it's both very clear and very ugly 😉
@cscott yes, it shares the "very ugly" (but not the "very clear") with most parts of core. :slightly_smiling_face:
@daniel hehe...
@cscott My gut feeling is that "format" should refer to "cachable or database" things, eg wikitext from the db, html from the cache, future-expensive-to-recompute-derived-content-type.
@cscott and some other word should be used for the set of transformations you can make to that 'format'. I like 'flavor', fwiw.
@cscott format=html,flavor=edit or format=html,flavor=mobile,edit
@cscott those dimensions seem to make intuitive sense

MSantos triaged this task as High priority.Jun 26 2023, 3:12 PM

Change 941523 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add HtmlOrDom abstraction to represent HTML as string or as DOM tree

https://gerrit.wikimedia.org/r/941523

Change #1035795 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Replace gettext calls with calls to runOutputPipeline

https://gerrit.wikimedia.org/r/1035795