(BIKESHED WARNING: Name of class can/has been bikeshedded.)
Parsoid needs to get access to a number of methods of ParserOutput in core, but the full ParserOutput class has too many dependencies on core to cleanly extract.
In particular, Parsoid needs to do a bunch of book-keeping to track page dependencies, so that Parsoid can be used in the post-edit hook to update these dependency tables.
The proposal is to add a new clean interface in Parsoid, something like:
interface ContentMetadataCollector { function addCategory(...); function setIndicator(....); function setTitleText(...); function addLanguageLink(...); function addWarning(...); function addExternalLink(...); function addLink(...); function addImage(...); function addTemplate(...); // etc }
and have ParserOutput in core implement this interface:
class ParserOutput extends CacheTime implements ContentMetadataCollector { ... }
Not every method in ParserOutput will be added to ContentMetadataCollector! ContentMetadataCollector is intended to be more-or-less "write-only" and none of the getters will be added to ContentMetadataCollector, nor will some of the non-bookkeeping functions (like ParserOutput::isLinkInternal -- why is that even there?).
Further, all method related to HTML output (::getText(), ::setTOCHTML() etc) are removed. Parsoid represents its output as a DOM (and related metadata) and uses PageBundle for this purpose. The getText methods are a mess, any way.
Some of the setters might need to be tweaked a bit if they use MW-specific classes; for example ParserOutput::addLink takes a Title as an argument, and the version of that in ContentMetadataCollector will need to be changed to take a dbKey string (like addImage does). [EDIT: @Pchelolo suggests some form of LinkTarget class.)
When Parsoid is run in standalone mode it will be provided with an implementation of ContentMetadataCollector (src/Config/Api/StandaloneContentMetadataCollector.php) which basically does nothing on every call . We might eventually extend this to record some of the information in standalone mode, but only to allow us to write better tests. For example, the parser test infrastructure contains a mode where it dumps all the categories as plain text; we might implement ContentMetadataCollector::addCategory() to store the information somewhere to support that mode of parser tests. (Adding a proxy ContentMetadataCollector to record information off to the side might also be a way to compare results of the legacy parser and parsoid in production.)
When Parsoid is run in integrated mode, the ContentMetadataCollector object it will be given will be a "real" ParserOutput. Hooks and extensions which expect to have access to a real ParserOutput will thus have access to one via typecast, as long as they are being run in integrated mode (which by definition they almost certainly are because the hook/extension mechanism is part of integrated mode).
Note 1: Methods will need to be added to ContentMetadataCollector in Parsoid with care to avoid breaking ParserOutput in core. Should be straightforward to commit the new method to core first, then update mediawiki-vendor to reference the latest parsoid with the new method in its ContentMetadataCollector interface. Worst-case we'll have to temporarily rename methods to allow signature changes to be done in multiple steps.
Note 2: There is some limit tracking code in core's Parser class (for example Parser::incrementExpensiveFunctionCount) which properly belongs in ParserOutput/ContentMetadataCollector so that Parsoid can bump the limits as it parses. See also Parser::makeLimitReport. Moving these functions from Parser to ParserOutput can be done as a non-blocking subtask. Once the methods are in ParserOutput then can easily be added to ContentMetadataCollector and then parsoid can call them.
Note 3: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/623032 is old but was an first step towards this task.