Page MenuHomePhabricator

[RFC] Spec for future wiki content processing model ("wikitext 2.0")
Closed, DeclinedPublic

Description

This is a placeholder / stub for following up on E259 and

https://www.mediawiki.org/wiki/Requests_for_comment/A_Spec_For_Wikitext

Related Objects

Event Timeline

TechCom discussed this briefly at E260. @tstarling is tentatively the shepherd for this, but isn't prepare to push for this. I'm more biased toward pushing for it, and may even be in the position to volunteer to be the assignee. It seems that this is an example of where TechCom could clarify the difference between "owner" and "shepherd".

One reason why this a spec seems important: do we have a good way of talking about the challenges posed with language variants (described by T113002) without this? With a spec/framework/something, we have a structure to talk about whether (and how) we can either deprecate the LanguageConverter features of wikitext (as described by T113002 option 1) or what we need to do to support LanguageConverter features in VisualEditor (T113002 option 2).

@ssastry, your thoughts on urgency/next steps?

My thoughts are that there needs to a clearer articulation of specific goals we want to shoot for and that can help identify what-kind-of/why-a spec is needed.

But, if I had to summarize my thinking today, it would be:

  1. We need to move towards a simplified wikitext-processing/wiki-content semantics that retains all of the power of today's wikitext. I sense that there is some degree of agreement here.
  2. We need to find a way to future-proof wikitext semantics of old revisions in the face of the changes above. There are a few different schools of thought here from what I've gathered from the RFC discussion and the parsing-team internal discussion.
    • (a) use html storage to evade the problem, i.e. if a rendering for a page is available, why do we care about wikitext semantics of a say, 2004, revision?
    • (b) wikitext semantics haven't dramatically changed from 2002/2003 that we couldn't use today's (or tomorrow's) wikitext parser to continue to render those old revisions to a high degree of faithfulness
    • (c) provide a simplified/reductive reference implementation of wikitext of today (which is more or less the wikitext of 2002 to borrow from (b) above).
  3. I am beginning to more and more like the idea of (c) since it is independent of HTML storage, i.e. archival needs for rendering are separated from historical needs to actually examine / process a revision in its original source form
  4. Separate from 1. and 2. above, we need a process / strategy of versioning our way to desired future spec / semantics from where we are today.

I like the above model because both of the following would be somewhat unfortunate:

  • In the desire to be able to faithfully parse wikitext of all the existing old revisions and provide a single spec that covers all-old-and-future-wikitext, we get hamstrung in our ability to actually evolve the wiki content model.
  • In the desire to evolve the wiki content model, we lose the ability to be able to meaningfully parse / process historical revisions of a page.

Given that we cannot make drastic / radical changes today, I don't see a huge urgency here. But, I think we should definitely make forward progress on getting clarity over goals. I think that is our best chance of figuring out why/what of specs here.

RobLa-WMF added a subscriber: daniel.

Subbu and I discussed this yesterday and last week following E259. I'll continue to work on this with him; in particular, I will work on laying out the goals as I see them.

Here's a few of the goals as I see them for retroactively providing a spec for Wikitext:

  • Create a human readable description of how wikitext will be interpreted by computers. Since Wikitext will be human editable for the foreseeable future, we need to ensure the markup means the same thing to humans and computers, and to explain cases where a good implementation will do counterintuitive things with the markup
  • Provide an explanation for when implementations differ. We already have many implementations of Wikitext (including MediaWiki's PHP parser and Parsoid), and the two most heavily-used WMF-maintained implementations differ in important ways. Let's help makers of other parser implementations have an easier time interoperating with our implementations
  • Declare intractable differences between Wikitext implementations as "undefined behavior" - With this, we could provide Wikitext authors much clearer guidance about which Wikitext constructs are difficult to implement, without taking the drastic step of breaking those constructs in our primary Wikitext implementation.

Subbu, you mentioned a couple of syntax definitions in E259: the PEG grammar and an EBNF. Are these the best references?

As Subbu points out, the tokenization is a small piece of the puzzle. We ultimately need to understand the author's intended HTML from any given wikitext. A model for a spec that seemed popular in the E259 discussion is an "executable spec" as defined in W3C's HTML5 Recommendation. The term "executable spec" was thrown around in E259 as if it was well-known jargon. Based on my very brief after-meeting research, I fear that the term means different things to different people. Regardless, @daniel proposed "an improved PEG grammar as proposed by Tim with lots of semi-formal comments" might be a sensible way to way to avoid a "complete" spec.

Toward the end of E259, Tim proposed "a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec". A snapshot of the rendering is good enough for most things, though a recording of changes to wikitext byte strings is still critical to keep a record of the author's intent with each edit.

Subbu, you mentioned a couple of syntax definitions in E259: the PEG grammar and an EBNF. Are these the best references?

Tim mentioned Preprocessor ABNF.

This PEG tokenizer incorporates ideas from the ABNF above. Since this tokenizer is Parsoid-centric, a spec for wikitext 1.0 revisions would need to include a tokenizer that purges these eccentricities.

As Subbu points out, the tokenization is a small piece of the puzzle.

Indeed. Also note that the tokenization is a "best-faith effort", i.e. there is no guarantee that these tokens will survive all the way to HTML generation in their original form. But, a large proportion of tokens will likely survive in their original form.

We ultimately need to understand the author's intended HTML from any given wikitext. A model for a spec that seemed popular in the E259 discussion is an "executable spec" as defined in W3C's HTML5 Recommendation. The term "executable spec" was thrown around in E259 as if it was well-known jargon. Based on my very brief after-meeting research, I fear that the term means different things to different people. Regardless, @daniel proposed "an improved PEG grammar as proposed by Tim with lots of semi-formal comments" might be a sensible way to way to avoid a "complete" spec.

Toward the end of E259, Tim proposed "a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec". A snapshot of the rendering is good enough for most things, though a recording of changes to wikitext byte strings is still critical to keep a record of the author's intent with each edit.

An executable spec in our case would specify (using whatever formalism is at our disposal) how the tokens produced by the tokenizer would be transformed to a HTML string. Since the goal of this spec is merely to aid HTML rendering of old revisions (rather than aid visual editing which is a much higher bar), it is possible to define a clear series of token transformations (Parsoid-style as in Parsoid internal architecure).

wikitext -> PEG tokenizer
         ---- top-level token stream ----> in-place expansion of templates (via the preprocessor)
         ---- fully expanded token stream ----> [ Series of transformations ] 
         ---- HTML tag soup ----> HTML5 Tree Builder 
         ---- DOM ---->  (optional Tidy-compatibility passes)
         ---- DOM ----> XML/HTML-serializer
         -> HTML output

For example, in Parsoid, a FSM clearly defines how indent-pres are processed where the transition actions can be clearly specified. I think if we don't require Parsoid functionality but only accurate rendering, most of the other tranformations (handling quotes, lists, paragraphs, link handling) could be similarly defined. Some of these transformations (ex: link rendering) would require wiki config + database state information, but that can be modelled as an API call.

This is just a sample skeleton since I've conveniently ignored extensions. If one ignores extensions that rely on tag hooks on php parser events, you could fit extensions into this model where you externalize the responsibility for spec-ing the output to the extension itself.

But, anyway, something like this would be an executable spec in that someone could follow these rules and code up a wikitext parser. It is possible to even come up with a reference implementation. But, if someone built a parser using this, it might be fairly slow and incur heavy i/o wait times. A high performance implementation would not be the point of this reference spec. In addition, this would not be completely faithful to how Mediawiki would rendered the page, but it would likely be "good enough" (hence a simplified / reductive spec).

So, this where I circle back to the goal question.

This spec would be "good enough" for someone wishing to parse wikitext and get reasonable looking output, and for reasoning about it. It would also be "good enough" for someone wishing to render old revisions (as long as they have access to the mediawiki API for the wiki where the page originated). If we did a good enough job of the spec by resorting to some simplifications for links, we might be able to reduce the reliance on the mediawiki API for only accessing source wikitext for templates. At that point, this spec would even be good enough to make sense of wikitext and render it in some intelligible fashion even if mediawiki itself rotted and became unmaintainable.

But, this would not be good enough to actually write a working parser as an alternative implementation for Parsoid / PHP parser that would be usable in a production context. That is a far more formidable task and not necessarily a task worth undertaking, in my opinion.

This spec would not be good enough to also render old revisions as MediaWiki itself might render it, but it would be good enough to get a reasonable sense of rendering of those revisions. If we expected mediawiki / wikitext to morph significantly in the future, HTML storage (however that is envisioned) is a more reasonable option for old revisions, but this is a digression.

I suppose there are other points on this spec spectrum, but I think it is important to be a bit more precise and also a bit more realistic about goals for wikitext 1.0 revisions.

The situation would be much more different for wikitext 2.0 where we would try to iron out some of these problems and shortcomings so that alternate implementations are more feasible.

What got me started down the "must have a spec" path was conversations at Wikimedia-Developer-Summit-2016, particularly @cscott's discussions of the challenges that language variants cause with VisualEditor. I think one unifying goal of all of this is:

  • Deploy VisualEditor to all Wikimedia wikis

That may sound unrelated, but here's the relation:

  • English (and many other western European languages) are well supported by VisualEditor, but complicated languages (e.g. those relying on language variants) and other i18n features (e.g. Translate) are difficult to support.
  • In order to fully support all wikis with VisualEditor, we need it to support all of the critical features of Wikitext.
  • In order to support "all of the critical features of Wikitext", we need to define what those are.
    • Is the "translate" tag (T131516) a critical feature of Wikitext?
    • Is language variant conversion (T43716) a critical feature?
    • Are the above features essential, or can the requirements be satisfied some other way?
    • Do we need to replace these features with better alternatives?

In my mind, the definition of "the critical features of Wikitext" is a spec. The three bullet points in my earlier comment (T142803#2569623) I think articulate the goals of a spec, but really, the litmus can be "can we deploy VisualEditor to all Wikimedia wikis?"

That shouldn't be our singular goal. IETF RFC 7764 (Guidance on Markdown) is an interesting read for anyone implementing a wiki syntax. There were a large number of employment affiliations of the people involved in writing that document, and the number of implementations considered impressive. We should aspire to get more institutions interested in improving the infrastructure behind our markup.

Sounds good. Where we stand today, Parsoid does a fair bit of "fit a square peg in a round hole" business towards supporting VE already and you still cannot edit everything you want cleanly in VE. So, to me, your unifying goal is an argument to develop a spec we would like to evolve wikitext towards. Or, maybe I am hearing in your words what I want to hear / want you to say. ;-)

Anyway, I think at this point, it would be more useful to have something to dig our teeth into and go from there. Towards that end, I am going to develop these old ideas in https://www.mediawiki.org/wiki/User:SSastry_(WMF)/Notes/Wikitext. I have a bunch of ideas brewing in my head since that writeup and if nothing, this is a good opportunity to put them to paper and see if they look as good on paper as they appear in my head. :-)

As promised, I put together a first draft at https://www.mediawiki.org/wiki/Parsing/Notes/Wikitext_2.0 which can be developed further / converted to an RFC, etc.

ArchCom discussed this a couple of weeks ago in E268, and I agreed to shepherd it.

A few comments:

  • A spec for "wikitext as it is parsed by the PHP parser today" may be infeasible. There is a lot of corner-case behavior due to artifacts in the implementation (cf T54661, doBlockLevels issues, etc) which we probably would consider bugs, not something to be faithfully duplicated by all. *However* some of the content in our projects undoubtedly takes advantage of (or works around) these corner cases, so if "write a spec suitable for future archivists" is your goal, this is where you have to start.
  • A spec for "wikitext as it is parsed by Parsoid today" is a little better; as @ssastry notes above, Parsoid's clean split into tokenizer and parser removes some of the corner cases where bugs hide in the PHP implementation. Unfortunately, Parsoid's tokenizer grammar (necessarily) contains a large amount of executable code; it is pretty far from a "pure grammar". Further, the Parsoid implementation has been steadily growing in complexity over the past three years to be more-and-more bug-compatible with PHP. This may not be a good thing. But it does mean that this would be closer to a "spec suitable for future archivists" if you are willing to deal with some minor breakage in corners of the archived wiki, and there are certainly parts of Parsoid which are more-formally-specified than the current PHP implementation.
    • A subtask here within the Parsoid team is to explore extending the tool we use to build the tokenizer (pegjs) to make it more powerful/efficient and allow us to reduce the amount of executable code in our tokenizer specification. This would make our tokenizer less ad hoc and closer to a formal specification.
  • To encourage wider adoption of wikitext markup (if that is our goal), what you want is probably something different: a spec for "wikitext as it ought to be". This would preserve features of wikitext considered important for human editors (perhaps this would include "single quote sequences for boldface and italics", even though this causes fits for formal grammars) while omitting or simplifying others (indent-pre behavior, doBlockLevels, template mechanisms, T14974). A project like Parsoid (call it "Parsoid2.0") could be written which uses our DOM spec as an intermediary to translate losslessly between "traditional wikitext" and "wikitext 2.0", and the traditional "wikitext" editor could be switched to wikitext 2.0. The goal would be 99% byte-for-byte compatibility with traditional wikitext, while allowing a concise and accurate formal specification. (For example, no executable code in the tokenizer!)
  • An offshoot here may be to embrace the MediaWiki DOM Spec as the formal standard. This may involve locking it down and/or building a formal extension mechanism to accomodate future changes. In some sense this is what we have been doing historically -- letting archive.org store the rendered HTML -- but elevated to an editable standard that preserves all the source information. (For archival purposes and if MediaWiki DOM was stored as a native format in our DB, we would want to 'unexpand' templates in the DOM to be closer to the original sources.) We would encourage 3rd-party interoperability with our DOM, not with wikitext itself, although we'd provide tools (like Parsoid) to round-trip from DOM to wikitext if 3rd parties wanted to provide wikitext editing.
    • As @RobLa-WMF notes above, we should also formally specify the DOM rendering of common extensions. We are well on the way toward doing this, with a facility to allow extensions to register their own Parsoid hooks and contentmodels, as existing implementations for the Cite, LST, and Translate extensions (and Gallery is in progress). Of these, only Cite and Gallery are included in the formal DOM spec at this time. (In theory, only the extension mechanism should be in the main DOM spec, and each extension would then be able to independently document the contents of their DOM Nodes.)
  • Finally, the template mechanism is somewhat orthogonal to the spec, and the PHP parser has a rough division into "preprocessor" and "parser" to reflect this. However in practice templates and wikitext are tightly interdependent (cf T146304). PHP sometimes allows token concatenation across template boundaries. Parsoid never allows this, but does allow template contents to leak out of tree nodes. The {{#balance}} tag is an attempt at enforcing a stricter separation on the output. I think there is consensus that we would *like* the template mechanism to be strictly modular both on the input wikitext token side *and* on the output DOM side.
    • The presence of executable Lua code in templates adds additional complexity to a formal specification of the template mechanism.
    • If we decide embrace the MediaWiki DOM spec, then we should reformulate the template mechanism (or a new alternate template mechanism, like T114454 or gwicke's angularjs-like proposal) in terms of DOM input and output. Clean separation probably comes "for free" when specifying the transformations on trees.
ssastry renamed this task from [RFC] STUB: Creating a spec for legacy wikitext / future wiki content processing model to [RFC] Spec for future wiki content processing model ("wikitext 2.0").Nov 29 2016, 11:21 PM

We talked about specs at the Parsing Team Offsite in October 2016 and the following text broadly captures the outcome of our discussion:

Develop parser specifications to aid with MediaWiki output interoperability, extension development, template and extension authors, and compliance of (alternate) wikitext implementations.

There are 4 different pieces to this:

1. Output spec: Parsoid DOM Spec (in place). We'll cleanup and update the documentation to make it more friendly. But, this spec helps with interoperability of Mediawiki output.

2. Implementation compliance spec: Parser tests (in place). We'll cleanup the test infrastructure and make it more usable and maintenable.

3. Implementation-neutral extension and Parser API spec: To be developed (enables pluggability of parsers, and extensions to hook into parser). This will help extension authors write extensions that can be supported in any implementation of a wikitext parser (vs. being tied to a specific parser's internals).

4. Language (wikitext) spec: To be developed in conjunction with evolving wikitext semantics. We will NOT attempt a spec for wikitext of today.
   * We will develop a base markup spec for wikitext markup
   * We will develop a DOM fragment composition spec that specifies how template and extension output will compose with the markup of a page
kchapman subscribed.

@ssastry is there still interest in writing a spec? TechCom will be more than happy to help write/review if so. Moving to TechCom-RFC backlog for now.

@ssastry is there still interest in writing a spec? TechCom will be more than happy to help write/review if so. Moving to TechCom-RFC backlog for now.

There is, as stated in T142803#2833509, but like everybody else, we are slammed with more work than we have time for as a small team. Realistically, if we wanted to get this done, we have to dedicate at least a quarter to just this task (that will include cleaning up and updating our dom spec, writing a html -> wt spec, clean up our parser test infrastructure so it is a better test spec rather than the mess it is right now - see T111604: Split parser tests into multiple files, and initiate work on the other pieces). But, yes TechCom-RFC backlog is the right place for this now.

I am going ot mark this declined for now. We will probably revisit some of these discussions in the future based on these rfc / irc sessions, wiki pages, dev summit talks.