Page MenuHomePhabricator

MediaWiki should be able to support multiple parser engines
Open, LowPublic

Description

There should be an interface that allows other parsers to be plugged into the software e.g. Parsoid (implementation of the same language in a different way) or for 3rd parties markdown or some other language (implementing the interface of the parser but providing a completely different language - seems to have been attempted in some form before)

Concretely, we'll need to pave a path to allow mobile web to switch to Parsoid as its main parser engine to serve VisualEditor better and allow us to optimise better for slower connections by having more control and semantic meaning behind content. Given the current state of Parsoid mobile switching to it would not have any known negative impacts to user experience (for instance video/audio not supported by Parsoid but already terrible support on mobile web).

Long term, we'll probably want to switch Vector and our other skins to Parsoid too when Parsoid is more complete. Mobile can help drive this.

Output: We'll need to setup an RFC for this.

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
ResolvedCatrope
ResolvedSbailey
OpenReleaseNone
OpenNone
OpenNone
Resolved GWicke
Resolvedssastry
ResolvedNone
ResolvedDbrant
Resolved bearND
Resolved Mholloway
ResolvedNone
OpenNone
OpenFeatureNone
OpenNone
ResolvedArlolra
ResolvedMooeypoo
ResolvedCatrope
Resolved GWicke
ResolvedArlolra
Resolved marcoil
Resolved marcoil
Resolved GWicke
ResolvedJdforrester-WMF
DuplicateNone
Resolved bearND
OpenNone
ResolvedArlolra
DeclinedNone
DuplicateNone
DuplicateNone
DeclinedNone
DeclinedNone
DeclinedNone
OpenNone

Event Timeline

Jdlrobson raised the priority of this task from to Low.
Jdlrobson updated the task description. (Show Details)

I spoke to @bd808 and I'm keen to flesh out the details of an RFC. He suggested @Anomie and @Tgr would be useful people to pull in to help with the details of an RFC and help unravel the gotchas with all this. Feel free to edit the description if you think what I'm articulating could be expressed in a better way.

I was toying with this notion in https://www.mediawiki.org/wiki/User:SSastry_%28WMF%29/Notes/Wikitext .. those are half-formed thoughts .. so ignore the ruminations about specific syntactical notions there. Adding a link to it FWIW.

There should be an interface that allows other parsers to be plugged into the software e.g. Parsoid (implementation of the same language in a different way) or for 3rd parties markdown or some other language (implementing the interface of the parser but providing a completely different language - seems to have been attempted in some form before)

A difficulty is that a "3rd parties markdown or some other language" would probably need to support both parsing and unparsing to be usable with things like VE, i.e. the whole reason Parsoid was written in the first place. And truthfully I suspect having multiple different syntaxes would be extremely confusing to users. "Ok, I more or less figured out wikitext. But this page is doing something completely different, WTF?" "Oh crap, I used wikitext link syntax here, but this is a markdown page so it rendered wrong!" And so on.

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser and I haven't heard anything about that changing. IMO the ideal situation would be that there would be no need for plugging in Parsoid for functionality at all, only for performance.

Concretely, we'll need to pave a path to allow mobile web to switch to Parsoid as its main parser engine to serve VisualEditor better

How would using Parsoid for page rendering "serve VisualEditor better"?

and allow us to optimise better for slower connections by having more control and semantic meaning behind content.

Wouldn't semantic meaning behind content be better served by focusing on making it easier for content authors to put semantic meaning into it in the first place? This sounds like your goal is just to replace the MobileFrontend pile of HTML hacks with a pile of DOM hacks in a service, instead of working towards getting rid of the need for content hacks for mobile support altogether.

Given the current state of Parsoid mobile switching to it would not have any known negative impacts to user experience

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Long term, we'll probably want to switch Vector and our other skins to Parsoid too when Parsoid is more complete.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser

*preprocessor* not the core parser or Tidy. Those are not used.

Extensions like Cite (and gallery and any other extension that processes wikitext - T110909) will need native Parsoid implementations. For all other extensions, we use the mediawiki API as a proxy to give us their HTML.

A difficulty is that a "3rd parties markdown or some other language" would probably need to support both parsing and unparsing to be usable with things like VE, i.e. the whole reason Parsoid was written in the first place. And truthfully I suspect having multiple different syntaxes would be extremely confusing to users. "Ok, I more or less figured out wikitext. But this page is doing something completely different, WTF?" "Oh crap, I used wikitext link syntax here, but this is a markdown page so it rendered wrong!" And so on.

Yes. I'm not saying that we should do that and we should certainly not do it on the Wikimedia cluster :). My point is good software is generic. Theoretically if it wanted to, MediaWiki could support sass, stylus as well as less with minimum changes to the PHP right now. I think this is important, as an admin might want to use SASS but not LESS because they are comfortable with that. In the process they'd lose certain extensions that depend on it but they should be able to make that trade off.

If a user wants VisualEditor and markdown support they'll have to make that happen as we won't but I think a 3rd party if they desired should be able to setup a MediaWiki with markdown support rather than mediawiki wikitext if they wanted to. I'd be proud if our software was written in a way to support that.

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser and I haven't heard anything about that changing. IMO the ideal situation would be that there would be no need for plugging in Parsoid for functionality at all, only for performance.

Yup. The chicken and egg got created some how though. Not saying it's gonna be easy :)

How would using Parsoid for page rendering "serve VisualEditor better"?

Right now when you click edit on Wikipedia, VisualEditor requests the full TML of the page via the Parsoid API. If Parsoid is powering the page, it can simply use that with no API request. That's big.

Wouldn't semantic meaning behind content be better served by focusing on making it easier for content authors to put semantic meaning into it in the first place? This sounds like your goal is just to replace the MobileFrontend pile of HTML hacks with a pile of DOM hacks in a service, instead of working towards getting rid of the need for content hacks for mobile support altogether.

@GWicke has some good ideas around this, e.g. using custom tags for infobox and other templates e.g. <infobox> which allows us to more accurately. The point of doing this is to remove the MobileFrontend hacks and not use hacks going forward.

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Not at all. Quite the opposite. This would actually empower us to split out Minerva from MobileFrontend as a separate standalone skin that is agnostic to the content area. Minerva would become agnostic to the parser rather than be coupled to it.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

Mostly the VE change, and I suspect in future allowing us to do things such as defer load components in the page, but I don't know why now, nor can I make a serious claim until we've worked this out in the mobile site which has a huge problem with performance right now on 2G connections due to bloated HTML.

See also T112999: Let MediaWiki operate entirely without wikitext, which is very similar. (One possible parser engine: none at all.) That task also outlines some places where wikitext syntax is hard coded into (eg) interface localization, so those things should be made more flexible as well.

How would using Parsoid for page rendering "serve VisualEditor better"?

Right now when you click edit on Wikipedia, VisualEditor requests the full TML of the page via the Parsoid API. If Parsoid is powering the page, it can simply use that with no API request. That's big.

OTOH, the Parsoid output is also big, as in file size and therefore bandwidth required. But below you write

nor can I make a serious claim until we've worked this out in the mobile site which has a huge problem with performance right now on 2G connections due to bloated HTML.

which seems to be in direct conflict with sending the whole Parsoid HTML on every page view.

That may also conflict with any DOM transformations that might be being done on the data before it's sent to the mobile browser. For example, the mobile site currently strips navboxes, what happens then when VE tries to save the HTML that's lacking the navboxes?

@GWicke has some good ideas around this, e.g. using custom tags for infobox and other templates e.g. <infobox> which allows us to more accurately. The point of doing this is to remove the MobileFrontend hacks and not use hacks going forward.

And what does that have to do with using Parsoid for page rendering? I hope Garbiel's plan for stuff like that isn't Parsoid-only.

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Not at all. Quite the opposite. This would actually empower us to split out Minerva from MobileFrontend as a separate standalone skin that is agnostic to the content area. Minerva would become agnostic to the parser rather than be coupled to it.

You're saying you can make one part closer by moving a more fundamental part further away. That doesn't sound like an improvement to me.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

Mostly the VE change, and I suspect in future allowing us to do things such as defer load components in the page, but I don't know why now,

The VE thing doesn't have anything to do with the skin. Handwaving means nothing, and it's not clear how "defer load components in the page" depends on Parsoid either.

As for VE on desktop, sure, that might help if we don't mind inflating page size for many readers to make things slightly faster for relatively few editors. Whether that trade-off is worth it I don't know.

@daniel I'm told you have a similar problem for Wikidata?

@Anomie please read the blocking bug (and its parent and its parent etc). It will hopefully enlighten you on how we expect this to be a performance boost. Essentially we want to move away from monolithic chunks of html to more semantically marked up components.

See also T112999: Let MediaWiki operate entirely without wikitext, which is very similar. (One possible parser engine: none at all.) That task also outlines some places where wikitext syntax is hard coded into (eg) interface localization, so those things should be made more flexible as well.

Yip! This is definitely another angle of this bug and another usecase that an RFC should support. Thanks!

Change 242781 had a related patch set uploaded (by Jdlrobson):
WIP: Allow parsers other than default parser

https://gerrit.wikimedia.org/r/242781

^ So this actually wasn't that hard... I built a proof of concept. Being able to show that Markdown can be used in MediaWiki with minimal work was pretty cool. I'm sure this is overly-simplistic and I've forgotten a few things... but I'd appreciate some thoughts on the approach.

This task seems to be a mix of two different concerns:

  • have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.
  • use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

This task seems to be a mix of two different concerns:

  • have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.
  • use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

I think there is utility to the first kind of use case, but less so for the second kind. I am not convinced that mixing multiple markup styles on the same wiki, let alone page, is useful (for reasons that anomie has already mentioned earlier).

I think jdrobson's proof of concept actually helps clarify this: you might have "Parsoid", "PHP" and "Purified/Tidied" parsers for "the same content" (wikitext). These are mostly compatible, but there might still be corner cases. For example, using the Parsoid parser for article content might be a BetaFeature, so that the same wiki uses both parsers depending on user preferences (during a transition/testing/experiment period, for example).

I suggested on the proof-of-concept patch that most of WikitextContent should actually be hoisted into a new superclass, (say "ParsedContent") so that MarkdownContent can be a more-or-less trivial subclass of ParsedContent. This would accomodate @Tgr's different concerns: clearly differentiate Wikitext from non-wikitext content, but still allow multiple *wikitext* parsers. (Also multiple *markdown* parsers, or multiple *wikitext 2.0* parsers, for similar reasons: there are cases where you can't convert from parser A to parser B in one fell swoop, so you might need multiple parsers for the same content type.)

This task seems to be a mix of two different concerns:

  • have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.
  • use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

Agreed. Right now the second is the most interesting but I think we can't do anything for the second without thinking about how the first might work.

I'm starting to think the following:

  1. A message should have a content type e.g. wikitext-message
  2. Each ContentHandler has an associated parser for that type. It seems we currently render a JSON file via the PHP parser for example which seems silly - why wouldn't we have a JSON prettifier Parser for that?
  3. Allow different parsers to be plugged in for these content types. e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.
  1. A message should have a content type e.g. wikitext-message

If we had different message types instead of it being determined at runtime by whether $msg->plain() or $msg->parse() or the like was called, that would make a bit more sense.

  1. It seems we currently render a JSON file via the PHP parser for example

False. Even if you're talking about pages of JavaScript code rather than JSON, it's still false. The JS code is run through the wikitext parser to pick up templates, categories, and the like for backwards compatibility, but the output HTML isn't actually used.

  1. e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.

I still think that's a bad idea for reasons previously stated. No need to go over it again.

  1. A message should have a content type e.g. wikitext-message

That would improve security a lot. I don't see any relation to this task though.

  1. Each ContentHandler has an associated parser for that type.

Currently it's up to Content::fillParserOutput() how to render the content. Text content (such as JSON) just calls htmlspecialchars() - that's basically a very simple parser. So having alternative parsing methods is not actually related to providing Parser implementations.

Parser does a few things apart from simple parsing:

  • it handles template transclusion
  • it allows extensions to hook in and extend the syntax
  • it allows the page to be marked as depending on other pages

If we need any of that for non-wikitext context (e.g. if we want Markdown that has transclusions and thumbnails like wikitext does) it makes sense to create an abstraction layer on top of Parser and allow content types to select which implementation to use. Otherwise, the existing mechanism seems sufficient.

Again I don't think this is really related to the current task.

  1. Allow different parsers to be plugged in for these content types. e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.

Again, Parsoid is not an alternative implementation of the current parser - it does not track dependencies and it does not handle parser plugins. It's something that runs "on top" of the current parser (or rather, some parts of the current parser) and produces alternate output. As long as we want to have both parsers for all wikitext pages, that's the reasonable way to go - no need to do those things twice.

So it doesn't really make sense to talk about replacing one parser with the other IMO - both need to run for every page. What you seem to care about is that - depending on the skin etc. - we should replace the output of the PHP parser with the output of Parsoid. I don't think an abstraction of what parsers are is needed or helpful there - just make Content::fillParserOutput() call the Parsoid API and use that instead of the normal parser output.

  1. A message should have a content type e.g. wikitext-message

That would improve security a lot. I don't see any relation to this task though.

It's related as MediaWiki currently assumes there is only one parser. So when we parse messages it defaults to the $wgParser. Enforcing a content type makes this more flexible.

  1. Each ContentHandler has an associated parser for that type.

Currently it's up to Content::fillParserOutput() how to render the content. Text content (such as JSON) just calls htmlspecialchars()˙ - that's a (very simple) parser. So having alternative parsing methods is not actually related to providing Parser˙implementations.

Content shouldn't care about what's rendering it. That's what I'm getting at. They shouldn't be so tightly coupled. It should assert a language. I feel that fillParserOutput might belong better in a Parser class - or maybe a Rendering class if we want to ignore the terminology that might be associated with a Parser. Just as a web page can render in Firefox, Chrome or Internet Explorer I should be able to choose on the client level how it's rendered and swap out the PHP parser without impacting other things.

So it doesn't really make sense to talk about replacing one parser with the other IMO - both need to run for every page. What you seem to care about is that - depending on the skin etc. - we should replace the output of the PHP parser with the output of Parsoid. I don't think an abstraction of what parsers are is needed or helpful there - just make Content::fillParserOutput() call the Parsoid API and use that instead of the normal parser output.

Yes but I want different clients to choose their parser. I want to render this content in Parsoid for mobile and use the PHP parser for desktop. This is why I am getting at the idea that we need some kind of abstraction and an interface to support this.

Change 246112 had a related patch set uploaded (by Jdlrobson):
Allow 3rd parties to change the Parser used for parsing content

https://gerrit.wikimedia.org/r/246112

Yes but I want different clients to choose their parser. I want to render this content in Parsoid for mobile and use the PHP parser for desktop. This is why I am getting at the idea that we need some kind of abstraction and an interface to support this.

MediaWiki doesn't have a nice modular architecture but in such an architecture there would be a page view controller which would determine what content type is appropriate and call the matching content service, which would return the content, possibly calling the parser service in the process. The part you want to change is the controller which right now does not have the process of multiple content services. The closest match is IMO controller -> Article, content service -> ParserCache, parser service -> Parser.

@Tgr The content service is pretty much what i'm proposing in T107595: [RFC] Multi-Content Revisions. The idea there is that you can have a variety of resources associated with a page/revision, some of which are derived from the primary content and possibly generated on demand.

Change 242781 abandoned by Jdlrobson:
WIP: Allow parsers other than default parser

Reason:
See https://gerrit.wikimedia.org/r/#/c/246112/

https://gerrit.wikimedia.org/r/242781

Change 246112 abandoned by Jdlrobson:
Allow 3rd parties to change the Parser used for parsing content

Reason:
No one cares about 3rd parties :(

https://gerrit.wikimedia.org/r/246112