Page MenuHomePhabricator

Let MediaWiki operate entirely without wikitext
Open, LowPublic

Description

When wikitext was created, in 1995, it served a vital function in allowing inexperienced users to easily create and edit pages. In the twenty years since, no standard wikitext emerged, and since 2004 the stripped-down formatting of Markdown has emerged as the plaintext formatting syntax of choice for much of the web. Mediawiki-style wikitext failed to be adopted outside our project. For a decade it has been on a declining mindshare trajectory.

It is time to decouple wikitext from core.

It should be possible to create an HTML-only wiki, with Visual Editor as the primary editing mechanism and no wikitext parsing for typical views and edits. Advanced users could install Parsoid to round-trip from the HTML DOM to wikitext for source editing, translating from wikitext back to the HTML DOM for database storage and display. Eventually new projects may arise to similarly allow round-trip "source" editing in other formats, such as Markdown or a new and refreshed "wikitext 2.0". But simple installations need none of that.

After outlining this vision, we will describe the architectural changes needed to achieve it:

  • [ContentHandler](https://www.mediawiki.org/wiki/Manual:ContentHandler) laid the groundwork for non-wikitext page content, we must build on it: An HTML-format "Mediawiki DOM" ContentHandler must be written, using DOM methods to separate sections and extract redirects. The "Mediawiki DOM" Content implementation must extract secondary data (links, categories, etc) directly from the DOM. (Alternatively, page metadata should be stored in a separate JSON "page metadata" attachment and custom editors provided.)
  • An HTML-based [DifferenceEngine](https://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/includes%2Fdiff%2FDifferenceEngine.php) must be implemented to allow visualizing changes without resorting to wikitext.
  • VisualEditor must be tweaked to fetch Mediawiki DOM directly, bypassing Parsoid; ditto on save.
  • System messages must be associated with a content model, to allow HTML-formatted system messages. Localization workflows need to accommodate non-wikitext messages. Most messages do not need formatting and should probably shift to a "plaintext" content model.
  • The [Sanitizer](https://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/includes%2FSanitizer.php) will need improvement so that it is appropriate to run directly on Mediawiki DOM.
  • Compatibility thunks are also desirable. These would use Parsoid to dynamically generate wikitext from the Mediawiki DOM to allow some legacy extensions and APIs to function.

Perhaps a rough prototype can be demonstrated. The attendees will be able to suggest other areas that might present roadblocks to an HTML-only wiki.

The long-term goal of the Parsoid team is for Parsoid to eventually disappear, replaced by HTML-only wikis and round-trip conversion tools to simpler "source" formats. The main Wikipedia projects will continue to rely on wikitext for a long time yet, but this work would be the first step towards deprecating Parsoid for some users: allowing small wikis to install a monolithic PHP-only mediawiki core with native HTML storage and visual editing, in the same way Flow has been able to use native HTML storage.

(Based on https://wikimania2015.wikimedia.org/wiki/Submissions/Mediawiki_without_wikitext)

SUMMIT GOALS

  • Agreement with stakeholders on the major implementation tasks above.
  • Input from broader community about wikitext dependencies in our tooling/processes/extensions/gadgets/etc which could be reimagined, could be reimplemented using Parsoid DOM, or require deeper thought.

Event Timeline

cscott raised the priority of this task from to Needs Triage.
cscott updated the task description. (Show Details)
cscott subscribed.

"The long-term goal of the Parsoid team is for Parsoid to eventually disappear" is an overstatement. We want its currently complex implementation to disappear. This requires addressing sources of unnecessary complexity in wikitext (hence wikitext 2.0) and figuring out how the wikitext <-> HTML bridge will continue to be supported .. whether via a stripped-down Parsoid or an in-core simplified parser is to be determined (answer to this last question will also be influenced by how mediawiki itself will evolve).

cscott set Security to None.

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

November 6, and this proposal doesn't seem to have much traction, it is not on track. Unless there is a sudden change, I will leave the ultimate decision of pre-scheduling it for the Wikimedia-Developer-Summit-2016 to @RobLa-WMF and the Architecture Committee.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Jdforrester-WMF subscribed.

Tagging with VE-ME as this long-term planning task is relevant to our interests.

Jdforrester-WMF renamed this task from Mediawiki without wikitext (sung as: It's an HTML world after all) to Let MediaWiki operate entirely without wikitext.Jun 13 2016, 10:21 PM
Jdforrester-WMF triaged this task as Low priority.
Jdforrester-WMF added a project: Epic.

Tagging with VE-ME as this long-term planning task is relevant to our interests.

I'm gathering by the "Low" priority, and putting it in the rather enormous list of tasks that are tagged VisualEditor-MediaWiki that the WMF Contributors-Team doesn't have near-term plans to address this. If that's the case, I can understand. There's a lot of exploratory work that needs to be done.

One route I see for achieving this in incremental stages would be:

  1. Turn this task into a roadmap wiki page on MediaWiki.org - We need to give people a clearer idea of where we're heading. The prose in the description should probably be transitioned to mediawiki.org, since MediaWiki is a better tool for collaborating on large swaths of prose than Phabricator is.
  2. Stabilize the Parsoid/MediaWiki DOM spec. Given the way the page moves were performed on this, it's difficult to gauge from the history just how stable the document is. Furthermore, the name of the document isn't stable, and the naming of the current version ("Specs/HTML/1.2.1") doesn't helpfully differentiate it from other important specifications. If we're going to execute well on the rest of this roadmap, this step is really important.
  3. Evangelize the stabilized spec. Once we have something we're reasonably confident we want other people to start building interoperable tools for, then let's get people doing that. For example, it'd be neat if one could use Pandoc to convert Parsoid markup to-and-from any of the other formats Pandoc already supports (e.g. many Markdown flavors, reStructuredText, AsciiDoc, DokuWiki markup, Emacs Org-Mode, Textile, DocBook, TEI Simple, GNU TexInfo, Groff, MS Word, OpenOffice/LibreOffice ODT, epub, OPML, etc etc.)
  4. Support interoperability with an interested third party. Once we have someone actually interested in using the spec, we'll inevitably find areas where the spec isn't clear. That's ok; the spec is never going to be "finished". The goal here is to help MediaWiki be more useful, by making Parsoid markup a trusted format for archiving. Having two interoperable tools that use the markup language builds necessary trust.
  5. Build a version of MediaWiki that allows for more interoperability. This is far enough out on my theoretical roadmap that I'm going to be vague here. However, I'll suggest one possible outcome: build a version of MediaWiki that uses Pandoc rather our existing parsing infrastructure as a backend. If Pandoc supported Parsoid HTML interchange, and offered a Markdown editing interface, that would allow us to support much easier migration from a Markdown wiki (e.g. Github wikis) to MediaWiki.

This is, of course, just one route. I'm sure there are others. I think the "roadmap wiki page" should probably be an TechCom-RFC. We can use our RFC process as a tool for scaling up our collaboration efforts, and getting more people involved. @cscott, would you be willing to transition this task's description prose to a wiki page on mediawiki.org?

  1. Stabilize the Parsoid/MediaWiki DOM spec. Given the way the page moves were performed on this, it's difficult to gauge from the history just how stable the document is. Furthermore, the name of the document isn't stable, and the naming of the current version ("Specs/HTML/1.2.1") doesn't helpfully differentiate it from other important specifications. If we're going to execute well on the rest of this roadmap, this step is really important.

The renaming is to more accurately reflect the versioning proposal that was discussed in an RFC. There will be changes to the Parsoid HTML spec, and we will always point the main page to the latest version. But, the changes won't be radical changes. Most of the time, there will be tweaks and sometimes changes like bringing in audio / video support into the spec, or as you noted in point #4, we might need to clarify some areas.

The description also provides an (alternate) incremental path, or at least a series of smaller tasks, to wit:

  • Write HTML-format "Mediawiki DOM" ContentHandler
  • Write HTML-based DifferenceEngine
  • Write DOM-based Sanitizer
  • Associate system messages with a content model
  • Tweak VE to bypass Parsoid when DOM is available from storage
  • Work on localization workflows to handle DOM formats
  • Factor out a "Parser API" that allows pluggable/configurable parsers

The factored-out "Parser API" would also provide a means for eventually unifying our mediawiki parsers. Once the core parser is untangled from mediawiki (via the ContentHandler, DIfferenceEngine, Sanitizer, and system message work above) we will be able to factor out the existing PHP parser into a library, and provide an alternative library which uses the VirtualRestService to invoke Parsoid (or directly integrates parsoid via php-embed or v8js), This will let us deprecate the pure PHP parser for WMF production use, while still providing it as a standalone for users who can't run Parsoid for one reason or another.

  1. Stabilize the Parsoid/MediaWiki DOM spec. Given the way the page moves were performed on this, it's difficult to gauge from the history just how stable the document is. Furthermore, the name of the document isn't stable, and the naming of the current version ("Specs/HTML/1.2.1") doesn't helpfully differentiate it from other important specifications. If we're going to execute well on the rest of this roadmap, this step is really important.

The renaming is to more accurately reflect the versioning proposal that was discussed in an RFC.

(/me looks at the current RFCs page and the archive) Which RFC? I wasn't able to find it in my cursory look. It would be very helpful to provide a link to the RFC from https://www.mediawiki.org/wiki/Specs (note, I commented on the talk page) about this.

I'm kinda surprised that we arrived at this way of referring to our specifications from an RFC; if I participated in the discussion, I'm disappointed in myself for not understanding the implications and making this point back then. https://www.mediawiki.org/wiki/Specs/HTML/1.2.1 is a really bad URI to describe what a potential reader should expect.

There will be changes to the Parsoid HTML spec, and we will always point the main page to the latest version. But, the changes won't be radical changes. Most of the time, there will be tweaks and sometimes changes like bringing in audio / video support into the spec, or as you noted in point #4, we might need to clarify some areas.

Given that we're talking about making this the normative format that all Wikimedia wiki projects will use, the specification, the naming of the specification, and versioning of the specification (i.e. the history of changes) needs to be clearer than this, doesn't it?

  1. Stabilize the Parsoid/MediaWiki DOM spec. Given the way the page moves were performed on this, it's difficult to gauge from the history just how stable the document is. Furthermore, the name of the document isn't stable, and the naming of the current version ("Specs/HTML/1.2.1") doesn't helpfully differentiate it from other important specifications. If we're going to execute well on the rest of this roadmap, this step is really important.

The renaming is to more accurately reflect the versioning proposal that was discussed in an RFC.

(/me looks at the current RFCs page and the archive) Which RFC? I wasn't able to find it in my cursory look. It would be very helpful to provide a link to the RFC from https://www.mediawiki.org/wiki/Specs (note, I commented on the talk page) about this.

I'm kinda surprised that we arrived at this way of referring to our specifications from an RFC; if I participated in the discussion, I'm disappointed in myself for not understanding the implications and making this point back then. https://www.mediawiki.org/wiki/Specs/HTML/1.2.1 is a really bad URI to describe what a potential reader should expect.

Sorry .. mine was a bit of a hasty response as I was running out. The RFC was about content negotiation and needing HTML versions, not about how the pages will be named. However, the naming of specs indirectly falls out since the profile urls in the accept headers need to exist.

There will be changes to the Parsoid HTML spec, and we will always point the main page to the latest version. But, the changes won't be radical changes. Most of the time, there will be tweaks and sometimes changes like bringing in audio / video support into the spec, or as you noted in point #4, we might need to clarify some areas.

Given that we're talking about making this the normative format that all Wikimedia wiki projects will use, the specification, the naming of the specification, and versioning of the specification (i.e. the history of changes) needs to be clearer than this, doesn't it?

What would make the history of changes clearer? Right now, there is a 'Changes since ....' section at the top, but this can also be documented separately.

Thanks for the quick responses, Subbu! I'm understanding this a lot better now, and realizing I need to spend some more time with this to fully come up to speed on this.

I now understand the reason behind relatively terse page names. I'm assuming that the URI is part of the spec, they need to be short and unambiguous (e.g like XML namespaces generally need to be). There's a lot more thinking/studying I should do before going to much deeper in this conversation.

More inline:

What would make the history of changes clearer? Right now, there is a 'Changes since ....' section at the top, but this can also be documented separately.

I'm not sure yet. The way that you're doing it makes a lot more sense to me than it used to. It's generally not going to make sense to someone browsing mediawiki.org, though. The relationship and the history of these documents isn't really explained:

  • Specs/HTML/1.2.0
  • Specs/HTML/1.2.1
  • Specs/data-parsoid/0.0.2
  • Specs/wikitext/1.0.0

It's possible to make an educated guess if you know the history of Parsoid, but educated guesses like this shouldn't be required.