MediaWiki should be able to support multiple parser engines
Open, LowPublic
Actions

Assigned To

None

Authored By

	Jdlrobson
	Sep 29 2015, 11:38 PM

Description

There should be an interface that allows other parsers to be plugged into the software e.g. Parsoid (implementation of the same language in a different way) or for 3rd parties markdown or some other language (implementing the interface of the parser but providing a completely different language - seems to have been attempted in some form before)

Concretely, we'll need to pave a path to allow mobile web to switch to Parsoid as its main parser engine to serve VisualEditor better and allow us to optimise better for slower connections by having more control and semantic meaning behind content. Given the current state of Parsoid mobile switching to it would not have any known negative impacts to user experience (for instance video/audio not supported by Parsoid but already terrible support on mobile web).

Long term, we'll probably want to switch Vector and our other skins to Parsoid too when Parsoid is more complete. Mobile can help drive this.

Output: We'll need to setup an RFC for this.

Details

	Subject	Repo	Branch	Lines +/-
	Allow 3rd parties to change the Parser used for parsing content	mediawiki/core	master	+51 -17
	WIP: Allow parsers other than default parser	mediawiki/core	master	+128 -25

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		None	T67243 Flow: Obey the Exlinks gadget (open external links in new tab/window)
Resolved		Catrope	T96855 A15. Flow rendering doesn't render external links correctly
Resolved		Sbailey	T58756 Parsoid doesn't give external links class="external free\|text"
Open	Release	None	T84936 Release VisualEditor-MediaWiki as "1.0"
Open		None	T50429 [Epic] Support editing parts of a page in VisualEditor-MediaWiki
Open		None	T54365 Explore performance gains from progressive (JIT?) de-alienation in VisualEditor
Resolved		• GWicke	T55783 Use CSS-based citation numbering
Resolved		ssastry	T64511 Mass rendering tests / visual diff
Resolved		None	T62017 Flow: Reference numbers in Parsoid output aren't superscripted
Resolved		Dbrant	T115488 Integrate the app fully with the Content Service.
Resolved		• bearND	T108777 Use Parsoid for new mobile-html-section routes
Resolved		• Mholloway	T161008 Link preview offline error message is shown for missing pages that return a 404 HTTP status
Resolved		None	T119266 Red links displayed as regular links when Restbase enabled.
Open		None	T174303 Copy-pasting linked ISBN numbers from view mode HTML into VisualEditor inserts wikitext links to Special:BookSources (it should turn them into magic links?)
Open	Feature	None	T54091 The read HTML should have hinting to allow full DOM copying (as opposed to just rich copying) from read mode into VE surfaces
Open		None	T55784 [EPIC] Use Parsoid HTML for all page views
Resolved		Arlolra	T53245 Link MediaWiki styles and create Parsoid-specific CSS styling to match MediaWiki's for differing DOM elements
Resolved		Mooeypoo	T55436 VisualEditor: Style Parsoid's <figure>s to look like MediaWiki's <div class="thumb">s rather than replacing them
Resolved		Catrope	T55505 VisualEditor: Provide a way for users to edit auto-numbered external links
Resolved		• GWicke	T55432 Typeof cleanup and smart serialization of new language and interwiki links
Resolved		Arlolra	T69540 Produce/preserve the metadata about additional ResourceLoader modules required by extension tags
Resolved		• marcoil	T73490 Parsoid should set the prop parameter when calling API action=expandtemplates
Resolved		• marcoil	T86902 Improve Parsoid's loading of CSS modules using ResourceLoader
Resolved		• GWicke	T68287 Thumb CSS misses background color and border on caption
Resolved		Jdforrester-WMF	T40726 VisualEditor: Links should follow the local CSS rules as if they were rendered in the view mode (e.g. external links shown with the external icon)
Duplicate		None	T35084 VisualEditor: Interproject links are coloured correctly (lighter blue), but also have an external link icon
Resolved		• bearND	T117505 Copy device independent transformations from Android app to service
Open		None	T168555 Investigate what to do with WikiLinkFixer now that Parsoid supports red links
Resolved		Arlolra	T39902 RFC: Implement rendering of redlinks in Parsoid HTML as post-processor
Declined		None	T98145 [EPIC] Lead images (WikidataPageBanner integration) to stable
Duplicate		None	T91712 Move lead paragraph above other article content
Duplicate		None	T65134 Images are too small in MobileFrontend
Declined		None	T109703 Collapse infoboxes by default / make it easier to skip infoboxes
Declined		None	T125920 [EPIC] Future exciting reading web performance endeavours
Declined		None	T101046 [EPIC] Use MCS as parser for main content in mobile web
Open		None	T114194 MediaWiki should be able to support multiple parser engines

Event Timeline

Jdlrobson created this task.Sep 29 2015, 11:38 PM

Jdlrobson raised the priority of this task from to Low.

Jdlrobson updated the task description. (Show Details)

Jdlrobson added projects: Epic, Reading-Web-Planning.

Jdlrobson added subscribers: Florian, Liuxinyu970226, • GWicke and 7 others.

I spoke to @bd808 and I'm keen to flesh out the details of an RFC. He suggested @Anomie and @Tgr would be useful people to pull in to help with the details of an RFC and help unravel the gotchas with all this. Feel free to edit the description if you think what I'm articulating could be expressed in a better way.

I was toying with this notion in https://www.mediawiki.org/wiki/User:SSastry_%28WMF%29/Notes/Wikitext .. those are half-formed thoughts .. so ignore the ruminations about specific syntactical notions there. Adding a link to it FWIW.

ssastry set Security to None.Sep 29 2015, 11:48 PM

ssastry added subscribers: tstarling, cscott.

There should be an interface that allows other parsers to be plugged into the software e.g. Parsoid (implementation of the same language in a different way) or for 3rd parties markdown or some other language (implementing the interface of the parser but providing a completely different language - seems to have been attempted in some form before)

A difficulty is that a "3rd parties markdown or some other language" would probably need to support both parsing and unparsing to be usable with things like VE, i.e. the whole reason Parsoid was written in the first place. And truthfully I suspect having multiple different syntaxes would be extremely confusing to users. "Ok, I more or less figured out wikitext. But this page is doing something completely different, WTF?" "Oh crap, I used wikitext link syntax here, but this is a markdown page so it rendered wrong!" And so on.

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser and I haven't heard anything about that changing. IMO the ideal situation would be that there would be no need for plugging in Parsoid for functionality at all, only for performance.

Concretely, we'll need to pave a path to allow mobile web to switch to Parsoid as its main parser engine to serve VisualEditor better

How would using Parsoid for page rendering "serve VisualEditor better"?

and allow us to optimise better for slower connections by having more control and semantic meaning behind content.

Wouldn't semantic meaning behind content be better served by focusing on making it easier for content authors to put semantic meaning into it in the first place? This sounds like your goal is just to replace the MobileFrontend pile of HTML hacks with a pile of DOM hacks in a service, instead of working towards getting rid of the need for content hacks for mobile support altogether.

Given the current state of Parsoid mobile switching to it would not have any known negative impacts to user experience

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Long term, we'll probably want to switch Vector and our other skins to Parsoid too when Parsoid is more complete.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

In T114194#1689873, @Anomie wrote:

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser

*preprocessor* not the core parser or Tidy. Those are not used.

Extensions like Cite (and gallery and any other extension that processes wikitext - T110909) will need native Parsoid implementations. For all other extensions, we use the mediawiki API as a proxy to give us their HTML.

A difficulty is that a "3rd parties markdown or some other language" would probably need to support both parsing and unparsing to be usable with things like VE, i.e. the whole reason Parsoid was written in the first place. And truthfully I suspect having multiple different syntaxes would be extremely confusing to users. "Ok, I more or less figured out wikitext. But this page is doing something completely different, WTF?" "Oh crap, I used wikitext link syntax here, but this is a markdown page so it rendered wrong!" And so on.

Yes. I'm not saying that we should do that and we should certainly not do it on the Wikimedia cluster :). My point is good software is generic. Theoretically if it wanted to, MediaWiki could support sass, stylus as well as less with minimum changes to the PHP right now. I think this is important, as an admin might want to use SASS but not LESS because they are comfortable with that. In the process they'd lose certain extensions that depend on it but they should be able to make that trade off.

If a user wants VisualEditor and markdown support they'll have to make that happen as we won't but I think a 3rd party if they desired should be able to setup a MediaWiki with markdown support rather than mediawiki wikitext if they wanted to. I'd be proud if our software was written in a way to support that.

As for plugging in Parsoid, you have a bit of a chicken-and-egg problem there since Parsoid still requires the PHP parser and I haven't heard anything about that changing. IMO the ideal situation would be that there would be no need for plugging in Parsoid for functionality at all, only for performance.

Yup. The chicken and egg got created some how though. Not saying it's gonna be easy :)

How would using Parsoid for page rendering "serve VisualEditor better"?

Right now when you click edit on Wikipedia, VisualEditor requests the full TML of the page via the Parsoid API. If Parsoid is powering the page, it can simply use that with no API request. That's big.

Wouldn't semantic meaning behind content be better served by focusing on making it easier for content authors to put semantic meaning into it in the first place? This sounds like your goal is just to replace the MobileFrontend pile of HTML hacks with a pile of DOM hacks in a service, instead of working towards getting rid of the need for content hacks for mobile support altogether.

@GWicke has some good ideas around this, e.g. using custom tags for infobox and other templates e.g. <infobox> which allows us to more accurately. The point of doing this is to remove the MobileFrontend hacks and not use hacks going forward.

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Not at all. Quite the opposite. This would actually empower us to split out Minerva from MobileFrontend as a separate standalone skin that is agnostic to the content area. Minerva would become agnostic to the parser rather than be coupled to it.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

Mostly the VE change, and I suspect in future allowing us to do things such as defer load components in the page, but I don't know why now, nor can I make a serious claim until we've worked this out in the mobile site which has a huge problem with performance right now on 2G connections due to bloated HTML.

See also T112999: Let MediaWiki operate entirely without wikitext, which is very similar. (One possible parser engine: none at all.) That task also outlines some places where wikitext syntax is hard coded into (eg) interface localization, so those things should be made more flexible as well.

In T114194#1690407, @Jdlrobson wrote:

How would using Parsoid for page rendering "serve VisualEditor better"?

Right now when you click edit on Wikipedia, VisualEditor requests the full TML of the page via the Parsoid API. If Parsoid is powering the page, it can simply use that with no API request. That's big.

OTOH, the Parsoid output is also big, as in file size and therefore bandwidth required. But below you write

nor can I make a serious claim until we've worked this out in the mobile site which has a huge problem with performance right now on 2G connections due to bloated HTML.

which seems to be in direct conflict with sending the whole Parsoid HTML on every page view.

That may also conflict with any DOM transformations that might be being done on the data before it's sent to the mobile browser. For example, the mobile site currently strips navboxes, what happens then when VE tries to save the HTML that's lacking the navboxes?

@GWicke has some good ideas around this, e.g. using custom tags for infobox and other templates e.g. <infobox> which allows us to more accurately. The point of doing this is to remove the MobileFrontend hacks and not use hacks going forward.

And what does that have to do with using Parsoid for page rendering? I hope Garbiel's plan for stuff like that isn't Parsoid-only.

But it would have the negative impact of going in the opposite direction from trying to bring mobile web and core MediaWiki closer together.

Not at all. Quite the opposite. This would actually empower us to split out Minerva from MobileFrontend as a separate standalone skin that is agnostic to the content area. Minerva would become agnostic to the parser rather than be coupled to it.

You're saying you can make one part closer by moving a more fundamental part further away. That doesn't sound like an improvement to me.

What perceived advantage are you thinking that Vector and other skins would get from Parsoid output?

Mostly the VE change, and I suspect in future allowing us to do things such as defer load components in the page, but I don't know why now,

The VE thing doesn't have anything to do with the skin. Handwaving means nothing, and it's not clear how "defer load components in the page" depends on Parsoid either.

As for VE on desktop, sure, that might help if we don't mind inflating page size for many readers to make things slightly faster for relatively few editors. Whether that trade-off is worth it I don't know.

@daniel I'm told you have a similar problem for Wikidata?

@Anomie please read the blocking bug (and its parent and its parent etc). It will hopefully enlighten you on how we expect this to be a performance boost. Essentially we want to move away from monolithic chunks of html to more semantically marked up components.

In T114194#1690451, @cscott wrote:

See also T112999: Let MediaWiki operate entirely without wikitext, which is very similar. (One possible parser engine: none at all.) That task also outlines some places where wikitext syntax is hard coded into (eg) interface localization, so those things should be made more flexible as well.

Yip! This is definitely another angle of this bug and another usecase that an RFC should support. Thanks!

Jdlrobson moved this task from To classify to sprint 71 on the Reading-Web-Planning board.Sep 30 2015, 8:29 PM

Change 242781 had a related patch set uploaded (by Jdlrobson):
WIP: Allow parsers other than default parser

https://gerrit.wikimedia.org/r/242781

gerritbot added a project: Patch-For-Review.Oct 1 2015, 1:34 AM

^ So this actually wasn't that hard... I built a proof of concept. Being able to show that Markdown can be used in MediaWiki with minimal work was pretty cool. I'm sure this is overly-simplistic and I've forgotten a few things... but I'd appreciate some thoughts on the approach.

This task seems to be a mix of two different concerns:

have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.
use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

In T114194#1691986, @Tgr wrote:

This task seems to be a mix of two different concerns:

have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.

use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

I think there is utility to the first kind of use case, but less so for the second kind. I am not convinced that mixing multiple markup styles on the same wiki, let alone page, is useful (for reasons that anomie has already mentioned earlier).

I think jdrobson's proof of concept actually helps clarify this: you might have "Parsoid", "PHP" and "Purified/Tidied" parsers for "the same content" (wikitext). These are mostly compatible, but there might still be corner cases. For example, using the Parsoid parser for article content might be a BetaFeature, so that the same wiki uses both parsers depending on user preferences (during a transition/testing/experiment period, for example).

I suggested on the proof-of-concept patch that most of WikitextContent should actually be hoisted into a new superclass, (say "ParsedContent") so that MarkdownContent can be a more-or-less trivial subclass of ParsedContent. This would accomodate @Tgr's different concerns: clearly differentiate Wikitext from non-wikitext content, but still allow multiple *wikitext* parsers. (Also multiple *markdown* parsers, or multiple *wikitext 2.0* parsers, for similar reasons: there are cases where you can't convert from parser A to parser B in one fell swoop, so you might need multiple parsers for the same content type.)

Anomie mentioned this in T114445: [RFC] Balanced templates.Oct 2 2015, 3:15 PM

In T114194#1691986, @Tgr wrote:

This task seems to be a mix of two different concerns:

have non-wikitext content types support their own parsers (e.g. Markdown) - that does not seem hard, ContentHandler already does most of the necessary abstraction. Replacing wikitext everywhere (e.g. in system messages) might be somewhat tricky, T112999 captures issues with that well.

use different parsers for the same content. Specifically for wikitext this would mean creating an abstraction layer on top of the parser, and implementing that either by the Parser class or a REST client which calls out to Parsoid.

Probably better to deal with those separately.

Agreed. Right now the second is the most interesting but I think we can't do anything for the second without thinking about how the first might work.

I'm starting to think the following:

A message should have a content type e.g. wikitext-message
Each ContentHandler has an associated parser for that type. It seems we currently render a JSON file via the PHP parser for example which seems silly - why wouldn't we have a JSON prettifier Parser for that?
Allow different parsers to be plugged in for these content types. e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.

A message should have a content type e.g. wikitext-message

If we had different message types instead of it being determined at runtime by whether $msg->plain() or $msg->parse() or the like was called, that would make a bit more sense.

It seems we currently render a JSON file via the PHP parser for example

False. Even if you're talking about pages of JavaScript code rather than JSON, it's still false. The JS code is run through the wikitext parser to pick up templates, categories, and the like for backwards compatibility, but the output HTML isn't actually used.

e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.

I still think that's a bad idea for reasons previously stated. No need to go over it again.

In T114194#1706256, @Jdlrobson wrote:

A message should have a content type e.g. wikitext-message

That would improve security a lot. I don't see any relation to this task though.

Each ContentHandler has an associated parser for that type.

Currently it's up to Content::fillParserOutput() how to render the content. Text content (such as JSON) just calls htmlspecialchars() - that's basically a very simple parser. So having alternative parsing methods is not actually related to providing Parser implementations.

Parser does a few things apart from simple parsing:

it handles template transclusion
it allows extensions to hook in and extend the syntax
it allows the page to be marked as depending on other pages

If we need any of that for non-wikitext context (e.g. if we want Markdown that has transclusions and thumbnails like wikitext does) it makes sense to create an abstraction layer on top of Parser and allow content types to select which implementation to use. Otherwise, the existing mechanism seems sufficient.

Again I don't think this is really related to the current task.

Allow different parsers to be plugged in for these content types. e.g. the mobile site should be able to swap the PHP parser out for Parsoid for wikitext.

Again, Parsoid is not an alternative implementation of the current parser - it does not track dependencies and it does not handle parser plugins. It's something that runs "on top" of the current parser (or rather, some parts of the current parser) and produces alternate output. As long as we want to have both parsers for all wikitext pages, that's the reasonable way to go - no need to do those things twice.

So it doesn't really make sense to talk about replacing one parser with the other IMO - both need to run for every page. What you seem to care about is that - depending on the skin etc. - we should replace the output of the PHP parser with the output of Parsoid. I don't think an abstraction of what parsers are is needed or helpful there - just make Content::fillParserOutput() call the Parsoid API and use that instead of the normal parser output.

In T114194#1706655, @Tgr wrote:

In T114194#1706256, @Jdlrobson wrote:

A message should have a content type e.g. wikitext-message

That would improve security a lot. I don't see any relation to this task though.

It's related as MediaWiki currently assumes there is only one parser. So when we parse messages it defaults to the $wgParser. Enforcing a content type makes this more flexible.

Each ContentHandler has an associated parser for that type.

Currently it's up to Content::fillParserOutput() how to render the content. Text content (such as JSON) just calls htmlspecialchars()˙ - that's a (very simple) parser. So having alternative parsing methods is not actually related to providing Parser˙implementations.

Content shouldn't care about what's rendering it. That's what I'm getting at. They shouldn't be so tightly coupled. It should assert a language. I feel that fillParserOutput might belong better in a Parser class - or maybe a Rendering class if we want to ignore the terminology that might be associated with a Parser. Just as a web page can render in Firefox, Chrome or Internet Explorer I should be able to choose on the client level how it's rendered and swap out the PHP parser without impacting other things.

So it doesn't really make sense to talk about replacing one parser with the other IMO - both need to run for every page. What you seem to care about is that - depending on the skin etc. - we should replace the output of the PHP parser with the output of Parsoid. I don't think an abstraction of what parsers are is needed or helpful there - just make Content::fillParserOutput() call the Parsoid API and use that instead of the normal parser output.

Yes but I want different clients to choose their parser. I want to render this content in Parsoid for mobile and use the PHP parser for desktop. This is why I am getting at the idea that we need some kind of abstraction and an interface to support this.

Change 246112 had a related patch set uploaded (by Jdlrobson):
Allow 3rd parties to change the Parser used for parsing content

https://gerrit.wikimedia.org/r/246112

Yes but I want different clients to choose their parser. I want to render this content in Parsoid for mobile and use the PHP parser for desktop. This is why I am getting at the idea that we need some kind of abstraction and an interface to support this.

MediaWiki doesn't have a nice modular architecture but in such an architecture there would be a page view controller which would determine what content type is appropriate and call the matching content service, which would return the content, possibly calling the parser service in the process. The part you want to change is the controller which right now does not have the process of multiple content services. The closest match is IMO controller -> Article, content service -> ParserCache, parser service -> Parser.

@Tgr The content service is pretty much what i'm proposing in T107595: [RFC] Multi-Content Revisions. The idea there is that you can have a variety of resources associated with a page/revision, some of which are derived from the primary content and possibly generated on demand.