Page MenuHomePhabricator

Convert Parsoid to dependency injection
Open, LowPublic

Description

To avoid tight coupling, classes should not statically call methods from other classes (with the possible exception of trivial utility methods) and should not instantiate objects (unless instantiating objects is their purpose, e.g. factory or dependency injection container classes). Instead, static methods should be turned into non-static ones, and the object holding them should be injected as a dependency; and new objects should be obtained from dependency-injected factory objects. This is the D of the SOLID OOP principles, and it is there to facilitate unit testing, customization and refactoring. See the dependency injection help page and RfC (although those were written for MediaWiki specifically).

At some point we should convert Parsoid to dependency injection, which raises the question of what the relationship between dependency injection in MediaWiki core and in MediaWiki libraries should be, as libraries should not depend on MediaWiki classes but we probably want to use the same container. Presumably MediaWikiServices should be made compatible with PSR-11 or core should provide a compatibility wrapper, but that still leaves the question of how Parsoid configures its own container and how it overrides it in tests.

Event Timeline

ssastry triaged this task as Medium priority.Apr 15 2019, 8:57 PM

As far as the interface between Parsoid-PHP and MediaWiki, the mental model I've been working with has been that Parsoid-PHP will implement (or will be wrapped to implement) a "Parser" service. It will have a few objects injected/passed implementing Parsoid-defined interfaces, with any MediaWiki services needed turn being injected into those objects.

  • "Global" configuration would be injected into the Parsoid-PHP service's constructor as an implementation of Parsoid\Config\SiteConfig.
    • This includes configuration settings, and access to standardized auxiliary services such as a PSR-3 logger and an implementation of Liuggio\StatsdClient\Factory\StatsdDataFactoryInterface.
  • Per-parse configuration would be passed to Parsoid-PHP's "parse" methods as an implementation of Parsoid\Config\PageConfig.
  • Data access would be passed to Parsoid-PHP's "parse" methods as an implementation of Parsoid\Config\DataAccess.
    • In MediaWiki Parser.php terms, these two resemble ParserOptions. PageConfig is intended as static configuration, while DataAccess handles fetching data not known to the caller along the lines of ParserOptions's getCurrentRevisionCallback() and the like.

Yet to be defined are the interfaces for parser hooks (e.g. what Parser.php passes as PPFrame, as well as methods for metadata setting and reentrant parsing) and interfaces for how Parsoid will return the data and metadata resulting from the parse (equivalent to ParserOutput).

More on the MediaWiki side, we have also yet to abstract an interface for this "Parser" service that can use both the existing Parser.php and Parsoid-PHP as implementations.

Presumably MediaWikiServices should be made compatible with PSR-11 or core should provide a compatibility wrapper

I note that PSR-11 specifically says "Users SHOULD NOT pass a container into an object so that the object can retrieve its own dependencies. This means the container is used as a Service Locator which is a pattern that is generally discouraged."

As far as the interface between Parsoid-PHP and MediaWiki, the mental model I've been working with has been that Parsoid-PHP will implement (or will be wrapped to implement) a "Parser" service.

This makes sense if Parsoid is an extension. This is what I have been assuming so far, I never took a closer look.

If however Parsoid is a library used by core, this should not even be needed. Code that follows a DI design generally has no need to, and in fact should not, have any use for or knowledge of a DI framework. A DI framework helps an application to "wire up" all the services it needs, which are just classes asking for things to be provided to them.

An example (from thin air, without having looked at the code): If Parsoid defines a ParsoidParser class that takes a ParserConfig object and a MessageLookup service and a TemplateLoader service and a Logger service, then it is up to MediaWiki to provide instances of these when constructing a ParsoidParser.

The idea behind DI is that the services just ask for things, and assume their are provided by the application (manually, or using a framework of some sort). This is sometimes called the Gaia principle: don't go looking for things, just ask for them (by requiring them as constructor parameters).

Maybe I'm missing some detail that makes this somehow problematic.

More on the MediaWiki side, we have also yet to abstract an interface for this "Parser" service that can use both the existing Parser.php and Parsoid-PHP as implementations.

This is only necessary or useful if Parsoid is to (even optionally) replace the old parser entirely; if as a first step Parsoid exists as an alternative thing in MW core, there is no need for it to be compatible with the traditional parser (yet). A layer that exposes the same interface for both of them can be added later.

All this being said: if Parsoid does need its own container, it can just use the same code that MW core uses. The Wikimedia\Services namespace is already isolated from core, we'd just need to pull it out into a separate repo and publish it on packagist.

If Parsoid is an extension, then it can interact with core's service container. If it's a library, then it MUST NOT interact with core's service container.

A few quick comments:

  • Parsoid is going to be a library that will be pulled into core via composer
  • This phab task is more about post-port (or beyond) work to clean up some code patterns *within* Parsoid around its own components ( ex: various token and dom transformers ) and using the DI pattern internally
  • The plan for the Parsoid <-> core interface is indeed what Brad & Daniel document. Parsoid will implement an interface that will accept objects which will be populated in core (via whatever mechanisms that core uses).
  • As for a common parsing interface that wraps both the legacy and parsoid parsing code, we aren't going to strive for that right away, i.e. our immediate goal is to move Parsoid to core, and during the unification we can visit that common interface and to what extent that is achievable / desirable (i.e. desirable to continue supporting the legacy parser's api in all the ways it is used) vs arriving at a parsing interface that we think sufficiently abstracts / hides the internals of the parser. We don't have to discuss and debate this now / here ... I am only mentioning it now since it came up ... but we can work out the specifics and details when we need to do that.

Thanks for clarifying, @ssastry

This phab task is more about post-port (or beyond) work to clean up some code patterns *within* Parsoid around its own components ( ex: various token and dom transformers ) and using the DI pattern internally

In that case, my point about libraries not needing a DI framework applies. Using DI means not using global/static variables anywhere, and asking for anything you need specifically, in the constructor, and not using "new" except for plain (ideally immutable) value objects.

That of course raises the question what, in the end, will indeed instantiate the classes. But that's not a question the library should try to answer. That's up to the application.

The task description says:

At some point we should convert Parsoid to dependency injection, which raises the question of what the relationship between dependency injection in MediaWiki core and in MediaWiki libraries should be, as libraries should not depend on MediaWiki classes but we probably want to use the same container.

Why do you think you would need any container in a library?

Thanks for clarifying, @ssastry

This phab task is more about post-port (or beyond) work to clean up some code patterns *within* Parsoid around its own components ( ex: various token and dom transformers ) and using the DI pattern internally

In that case, my point about libraries not needing a DI framework applies. Using DI means not using global/static variables anywhere, and asking for anything you need specifically, in the constructor, and not using "new" except for plain (ideally immutable) value objects.

Exactly .. this is the plan ... we already have pipeline objects doing all the construction of objects and passing them to constructors .. but there is still a bunch code that does component construction internally which makes it hard for unit testing and mocking. By injecting those components via constructors, mocking and unit testing is enabled. I didn't want to undertake that refactoring (even if simple) during the port so that we can do this better and uniformly in one shot instead of doing piecemeal now. Hence this phab task.

The task description says:

At some point we should convert Parsoid to dependency injection, which raises the question of what the relationship between dependency injection in MediaWiki core and in MediaWiki libraries should be, as libraries should not depend on MediaWiki classes but we probably want to use the same container.

Why do you think you would need any container in a library?

That is for @Tgr.

An example (from thin air, without having looked at the code): If Parsoid defines a ParsoidParser class that takes a ParserConfig object and a MessageLookup service and a TemplateLoader service and a Logger service, then it is up to MediaWiki to provide instances of these when constructing a ParsoidParser.

That's pretty much the model I've been working with, although some of the "services" to be injected are more of adapters than things that would go into MediaWikiServices.

One thing to keep in mind is that (I think) we're not wanting to tie Parsoid-PHP strongly to MediaWiki, we want it to be usable by other codebases that might need to process wikitext as well. So we're using the standardized PSR-3 logging, and interfaces from popular composer libraries like https://packagist.org/packages/liuggio/statsd-php-client, but I avoided depending on MediaWiki-internal things like LinkTarget when I defined the interfaces so far.

So, for example, rather than injecting RevisionStore into Parsoid-PHP directly we're injecting a "DataAccess", where the MediaWiki implementation of DataAccess will use RevisionStore as its backend. We also have a separate DataAccess implementation in Parsoid-PHP that instead hits MediaWiki's action API using curl to fetch the data, like Parsoid-JS does.

This is only necessary or useful if Parsoid is to (even optionally) replace the old parser entirely;

That is the plan as I understand it. We'll likely have some time at an intermediate state where either parser can be selected via LocalSettings.php (possibly even a beta feature) or the like while we verify the integration of Parsoid-PHP.

if as a first step Parsoid exists as an alternative thing in MW core, there is no need for it to be compatible with the traditional parser (yet). A layer that exposes the same interface for both of them can be added later.

On the other hand, that might result in more work as we'd have to maintain parallel code in Cite, TemplateStyles, and every other parser-hooking extension for both Parser.php and Parsoid-PHP for whatever timeframe the switchover takes rather than just doing a one-time port for each extension from Parser.php to the common interface.

One thing to keep in mind is that (I think) we're not wanting to tie Parsoid-PHP strongly to MediaWiki, we want it to be usable by other codebases that might need to process wikitext as well. So we're using the standardized PSR-3 logging, and interfaces from popular composer libraries like https://packagist.org/packages/liuggio/statsd-php-client, but I avoided depending on MediaWiki-internal things like LinkTarget when I defined the interfaces so far.

Yes, framework isolation is a good thing, even if it means writing a bunch of glue code in adapters.

In my mind, Parsoid should be completely independent of MediaWiki core. One thing that we could consider is putting MediaWiki's data model into a separate library, for use be client code (and libraries). This library would contain things like LinkTarget and TitleValue and UserIdentity and RevisionRecord and PageRecord. Basically, any entity you would find represented in an API result.

This establishes a shared vocabulary and avoids the overhead of translating between contexts. Could be useful. Could also lead to unrelated code getting "polluted" with MediaWiki concepts.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever. I'd like to reduce the dependencies between mediawiki core and any one particular parser implementation.

That's not to say that Parsoid can't *also* be a composer library. In fact, I'd like to see robust support for reusing Parsoid *as a library* "usable by other codebases that might need to process wikitext as well" as @Anomie says. Markdown is used extensively across the web because there are multiple independent implementations not tied to a particular framework; if we want to encourage broader support of wikitext we need to continue to reduce the entanglements beween wikitext and core and allow the creation of parser implementations independent of mediawiki.

That seems to be the aim of dependency injection, after all -- give Parsoid access to core where it needs it, but not in a way that will break if Parsoid is used outside of MediaWiki.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

I disagree with you on most of that too, but all that is not really relevant to this task.

I'd like to reduce the dependencies between mediawiki core and any one particular parser implementation.

This part is relevant, and here I do agree that it would be beneficial to have a clearer interface that could in theory be implemented by extensions.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

I disagree with you on most of that too, but all that is not really relevant to this task.

For the record, I agree with you here, @cscott :) I'd love to see a MediaWiki core that is oblivious of wikitext. Not because I want to kill wikitext, but because I want the flexibility moving it out provides.

I'd like to reduce the dependencies between mediawiki core and any one particular parser implementation.

This part is relevant, and here I do agree that it would be beneficial to have a clearer interface that could in theory be implemented by extensions.

I'm all for it, I just meant to say that it may not be necessary right away.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension. In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

I disagree with you on most of that too, but all that is not really relevant to this task.

For the record, I agree with you here, @cscott :) I'd love to see a MediaWiki core that is oblivious of wikitext. Not because I want to kill wikitext, but because I want the flexibility moving it out provides.

Since everyone is "going on the record" :-) I suppose I should as well. I don't disagree with the goal of making MW core oblivious to wikitext or that it would be useful to allow us to plug other markup engines into core. But, I haven't been convinced that extension mechanism is the way to do it, and for sure, I don't want to make Parsoid an extension prematurely now. Once we have killed the legacy parser and have a parsing interface that we are happy with that abstracts away internals, we can think of what the appropriate mechanisms for introducing other markup engines might be. But, trying to engineer all that now is premature and not a good use of our time.

If however Parsoid is a library used by core, this should not even be needed. Code that follows a DI design generally has no need to, and in fact should not, have any use for or knowledge of a DI framework.

For one thing, Parsoid is a mix of a library and an application; it can be used standalone, with HTTP calls to a remote MediaWiki installation, and we probably want to preserve that (it's used for Kiwix builds, there are offline editing applications etc). Plus we'll need the same functionality for wiring up integration tests, anyway.

But even apart from that, complex libraries do need a DI container. They shouldn't directly obtain the DI container; that's a different thing. After everything is converted to non-static calls, Parsoid will have an object graph with something like two dozen classes; it is unreasonable to expect all users of the library to deal with that (and update it as it changes, which would force us to make any update changing anything about object construction a major one). And there is no reason for most of those internal services to pollute to global service namespace, either. And having to coordinate the deploy of every Parsoid change which touches on object constuction with a corresponding MediaWiki change would be a major PITA.

The reasonable approach would be to provide a "helper" DI container which manages the dependencies of the main parser service, but leave it up to the main DI container of application to set up and use that container.

All this being said: if Parsoid does need its own container, it can just use the same code that MW core uses. The Wikimedia\Services namespace is already isolated from core, we'd just need to pull it out into a separate repo and publish it on packagist.

That works too but it's nicer for a library to not force a specific DI container on its users, that's why PSR-11 exists. (Less work for us, too.) We could just use the service wiring array format (but with PSR-11 instead on MediaWikiServices as the argument), that's easy enough to integrate into pretty much any DIC system.

I know @ssastry and I disagree on this, but I'd like for us to move toward a place where Parsoid is an extension.

We want to make it possible to use Parsoid outside of MediaWiki, and we want to keep it well-separated so people studying it for coming up with their own parser implementations have an easier time, so it has to be a library. Whether that library is used by MediaWiki core or an extension is a separate matter, and - once the parser has been turned into a decoupled component that's only accessed via dependency injection, which will be necessary for the Parsoid port anyway - a fairly trivial one. Objects using a service have no knowledge of where/how that service was constructed - that's one of the points of DI.

In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

Having multiple wikitext parsers and having different parsers for different markups are two entirely different layers of abstraction. The latter is already mostly possible via the content handler mechanism, and the extent to which it is not possible has less to do with the parser and more with baked-in assumptions about certain things (like system messages) always use wikitext. I don't see much overlap between wider support for non-wikitext markup and the Parsoid-PHP project.

In my ideal world, both the legacy parser and Parsoid could be extensions implementing a Parser interface -- along with (eventually) perhaps a "wikitext 2.0" parser and even maybe a ReStructuredText or markdown parser or whatever.

Having multiple wikitext parsers and having different parsers for different markups are two entirely different layers of abstraction. The latter is already mostly possible via the content handler mechanism, and the extent to which it is not possible has less to do with the parser and more with baked-in assumptions about certain things (like system messages) always use wikitext. I don't see much overlap between wider support for non-wikitext markup and the Parsoid-PHP project.

I agree it's mostly off-topic for/orthogonal to this thread, which should stay focused on concrete DI mechanism. The only point of connection might be adding $contentType="wikitext/1.0"/$resulttype="html/2.1.0" parameters to some top-level getParserService call, to keep the door open for backwards-incompatible changes to wikitext / etc in the future. Parsoid-as-a-service already has similar versioning ("html/2.1.0") on the parse result, as a content-type header in the request, and as part of its returned document, but that mechanism doesn't work without a REST layer between Parsoid and core.

One could imagine a complete DI conversion allowing multiple Parsoid services to be instantiated, including both "latest parsoid" as well as earlier versions emitting older HTML versions. T114413: Support various conversions in Parsoid's pb2pb endpoint discusses these use cases; so far we've handled them by adding backwards-compatibility code to Parsoid instead of actually running multiple versions. *In theory* DI could let us keep multiple versions of Parsoid separate, although the exact mechanics of this (esp a new namespace for every Parsoid version, which AFAIK would require updating the namespace line in every Parsoid source file) are a bit beyond me.

The only point of connection might be adding $contentType="wikitext/1.0"/$resulttype="html/2.1.0" parameters to some top-level getParserService call, to keep the door open for backwards-incompatible changes to wikitext / etc in the future.

Service calls cannot have parameters (since they get cached), you can either have the DI container return a service factory instead and have that factory decide how to construct the parser given the version requirements, or have one parser and pass in the version along with the data to be parsed. The first is more flexible in that it allows constructing the parser with a different dependency graph for different versions, which makes B/C less of a pain. OTOH due to templates at least the input version will be mixed during a parsing process, which means different parser objects will have to work together and share the parser state, which probably makes things more challenging.

*In theory* DI could let us keep multiple versions of Parsoid separate, although the exact mechanics of this (esp a new namespace for every Parsoid version, which AFAIK would require updating the namespace line in every Parsoid source file) are a bit beyond me.

The nice approach is to wrangle the architecture until everything that changes between versions aligns with a plane of abstraction and you just need to load a different set of strategies or serializers or whatever into the parser for different input / output versions.

The easy apprach is to just snapshot the files and put them into a separate namespace, as you said.

ssastry lowered the priority of this task from Medium to Low.Mar 30 2020, 10:48 PM