This is a placeholder / stub for following up on E259 and
https://www.mediawiki.org/wiki/Requests_for_comment/A_Spec_For_Wikitext
This is a placeholder / stub for following up on E259 and
https://www.mediawiki.org/wiki/Requests_for_comment/A_Spec_For_Wikitext
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Qgil | T153007 Technical Collaboration annual plan FY2017-18 | |||
Resolved | Qgil | T159313 Draft WMF annual plan program about technical events | |||
Resolved | Qgil | T149300 Future of the Wikimedia Developer Summit | |||
Resolved | • Rfarrand | T153996 Wikimedia Developer Summit 2017: Feedback Survey | |||
Resolved | • Rfarrand | T141926 Wikimedia Developer Summit 2017 | |||
Resolved | Qgil | T141938 Prepare a program for Wikimedia Developer Summit 2017 to effectively address current high level movement needs | |||
Resolved | cscott | T147602 Facilitate Wikidev'17 main topic "Handling wiki content beyond plaintext" | |||
Resolved | cscott | T151950 Wikitext 2.0 Session at Wikidev'17 | |||
Declined | ssastry | T142803 [RFC] Spec for future wiki content processing model ("wikitext 2.0") |
TechCom discussed this briefly at E260. @tstarling is tentatively the shepherd for this, but isn't prepare to push for this. I'm more biased toward pushing for it, and may even be in the position to volunteer to be the assignee. It seems that this is an example of where TechCom could clarify the difference between "owner" and "shepherd".
One reason why this a spec seems important: do we have a good way of talking about the challenges posed with language variants (described by T113002) without this? With a spec/framework/something, we have a structure to talk about whether (and how) we can either deprecate the LanguageConverter features of wikitext (as described by T113002 option 1) or what we need to do to support LanguageConverter features in VisualEditor (T113002 option 2).
@ssastry, your thoughts on urgency/next steps?
My thoughts are that there needs to a clearer articulation of specific goals we want to shoot for and that can help identify what-kind-of/why-a spec is needed.
But, if I had to summarize my thinking today, it would be:
I like the above model because both of the following would be somewhat unfortunate:
Given that we cannot make drastic / radical changes today, I don't see a huge urgency here. But, I think we should definitely make forward progress on getting clarity over goals. I think that is our best chance of figuring out why/what of specs here.
Subbu and I discussed this yesterday and last week following E259. I'll continue to work on this with him; in particular, I will work on laying out the goals as I see them.
Here's a few of the goals as I see them for retroactively providing a spec for Wikitext:
Subbu, you mentioned a couple of syntax definitions in E259: the PEG grammar and an EBNF. Are these the best references?
As Subbu points out, the tokenization is a small piece of the puzzle. We ultimately need to understand the author's intended HTML from any given wikitext. A model for a spec that seemed popular in the E259 discussion is an "executable spec" as defined in W3C's HTML5 Recommendation. The term "executable spec" was thrown around in E259 as if it was well-known jargon. Based on my very brief after-meeting research, I fear that the term means different things to different people. Regardless, @daniel proposed "an improved PEG grammar as proposed by Tim with lots of semi-formal comments" might be a sensible way to way to avoid a "complete" spec.
Toward the end of E259, Tim proposed "a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec". A snapshot of the rendering is good enough for most things, though a recording of changes to wikitext byte strings is still critical to keep a record of the author's intent with each edit.
Tim mentioned Preprocessor ABNF.
- mw:PEG tokenizer (I just split that page from Parsoid/Internals)
This PEG tokenizer incorporates ideas from the ABNF above. Since this tokenizer is Parsoid-centric, a spec for wikitext 1.0 revisions would need to include a tokenizer that purges these eccentricities.
As Subbu points out, the tokenization is a small piece of the puzzle.
Indeed. Also note that the tokenization is a "best-faith effort", i.e. there is no guarantee that these tokens will survive all the way to HTML generation in their original form. But, a large proportion of tokens will likely survive in their original form.
We ultimately need to understand the author's intended HTML from any given wikitext. A model for a spec that seemed popular in the E259 discussion is an "executable spec" as defined in W3C's HTML5 Recommendation. The term "executable spec" was thrown around in E259 as if it was well-known jargon. Based on my very brief after-meeting research, I fear that the term means different things to different people. Regardless, @daniel proposed "an improved PEG grammar as proposed by Tim with lots of semi-formal comments" might be a sensible way to way to avoid a "complete" spec.
Toward the end of E259, Tim proposed "a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec". A snapshot of the rendering is good enough for most things, though a recording of changes to wikitext byte strings is still critical to keep a record of the author's intent with each edit.
An executable spec in our case would specify (using whatever formalism is at our disposal) how the tokens produced by the tokenizer would be transformed to a HTML string. Since the goal of this spec is merely to aid HTML rendering of old revisions (rather than aid visual editing which is a much higher bar), it is possible to define a clear series of token transformations (Parsoid-style as in Parsoid internal architecure).
wikitext -> PEG tokenizer ---- top-level token stream ----> in-place expansion of templates (via the preprocessor) ---- fully expanded token stream ----> [ Series of transformations ] ---- HTML tag soup ----> HTML5 Tree Builder ---- DOM ----> (optional Tidy-compatibility passes) ---- DOM ----> XML/HTML-serializer -> HTML output
For example, in Parsoid, a FSM clearly defines how indent-pres are processed where the transition actions can be clearly specified. I think if we don't require Parsoid functionality but only accurate rendering, most of the other tranformations (handling quotes, lists, paragraphs, link handling) could be similarly defined. Some of these transformations (ex: link rendering) would require wiki config + database state information, but that can be modelled as an API call.
This is just a sample skeleton since I've conveniently ignored extensions. If one ignores extensions that rely on tag hooks on php parser events, you could fit extensions into this model where you externalize the responsibility for spec-ing the output to the extension itself.
But, anyway, something like this would be an executable spec in that someone could follow these rules and code up a wikitext parser. It is possible to even come up with a reference implementation. But, if someone built a parser using this, it might be fairly slow and incur heavy i/o wait times. A high performance implementation would not be the point of this reference spec. In addition, this would not be completely faithful to how Mediawiki would rendered the page, but it would likely be "good enough" (hence a simplified / reductive spec).
So, this where I circle back to the goal question.
This spec would be "good enough" for someone wishing to parse wikitext and get reasonable looking output, and for reasoning about it. It would also be "good enough" for someone wishing to render old revisions (as long as they have access to the mediawiki API for the wiki where the page originated). If we did a good enough job of the spec by resorting to some simplifications for links, we might be able to reduce the reliance on the mediawiki API for only accessing source wikitext for templates. At that point, this spec would even be good enough to make sense of wikitext and render it in some intelligible fashion even if mediawiki itself rotted and became unmaintainable.
But, this would not be good enough to actually write a working parser as an alternative implementation for Parsoid / PHP parser that would be usable in a production context. That is a far more formidable task and not necessarily a task worth undertaking, in my opinion.
This spec would not be good enough to also render old revisions as MediaWiki itself might render it, but it would be good enough to get a reasonable sense of rendering of those revisions. If we expected mediawiki / wikitext to morph significantly in the future, HTML storage (however that is envisioned) is a more reasonable option for old revisions, but this is a digression.
I suppose there are other points on this spec spectrum, but I think it is important to be a bit more precise and also a bit more realistic about goals for wikitext 1.0 revisions.
The situation would be much more different for wikitext 2.0 where we would try to iron out some of these problems and shortcomings so that alternate implementations are more feasible.
What got me started down the "must have a spec" path was conversations at Wikimedia-Developer-Summit-2016, particularly @cscott's discussions of the challenges that language variants cause with VisualEditor. I think one unifying goal of all of this is:
That may sound unrelated, but here's the relation:
In my mind, the definition of "the critical features of Wikitext" is a spec. The three bullet points in my earlier comment (T142803#2569623) I think articulate the goals of a spec, but really, the litmus can be "can we deploy VisualEditor to all Wikimedia wikis?"
That shouldn't be our singular goal. IETF RFC 7764 (Guidance on Markdown) is an interesting read for anyone implementing a wiki syntax. There were a large number of employment affiliations of the people involved in writing that document, and the number of implementations considered impressive. We should aspire to get more institutions interested in improving the infrastructure behind our markup.
Sounds good. Where we stand today, Parsoid does a fair bit of "fit a square peg in a round hole" business towards supporting VE already and you still cannot edit everything you want cleanly in VE. So, to me, your unifying goal is an argument to develop a spec we would like to evolve wikitext towards. Or, maybe I am hearing in your words what I want to hear / want you to say. ;-)
Anyway, I think at this point, it would be more useful to have something to dig our teeth into and go from there. Towards that end, I am going to develop these old ideas in https://www.mediawiki.org/wiki/User:SSastry_(WMF)/Notes/Wikitext. I have a bunch of ideas brewing in my head since that writeup and if nothing, this is a good opportunity to put them to paper and see if they look as good on paper as they appear in my head. :-)
As promised, I put together a first draft at https://www.mediawiki.org/wiki/Parsing/Notes/Wikitext_2.0 which can be developed further / converted to an RFC, etc.
A few comments:
I'm proposing to include this in a merged session: T151950: Wikitext 2.0 Session at Wikidev'17.
We talked about specs at the Parsing Team Offsite in October 2016 and the following text broadly captures the outcome of our discussion:
Develop parser specifications to aid with MediaWiki output interoperability, extension development, template and extension authors, and compliance of (alternate) wikitext implementations. There are 4 different pieces to this: 1. Output spec: Parsoid DOM Spec (in place). We'll cleanup and update the documentation to make it more friendly. But, this spec helps with interoperability of Mediawiki output. 2. Implementation compliance spec: Parser tests (in place). We'll cleanup the test infrastructure and make it more usable and maintenable. 3. Implementation-neutral extension and Parser API spec: To be developed (enables pluggability of parsers, and extensions to hook into parser). This will help extension authors write extensions that can be supported in any implementation of a wikitext parser (vs. being tied to a specific parser's internals). 4. Language (wikitext) spec: To be developed in conjunction with evolving wikitext semantics. We will NOT attempt a spec for wikitext of today. * We will develop a base markup spec for wikitext markup * We will develop a DOM fragment composition spec that specifies how template and extension output will compose with the markup of a page
@ssastry is there still interest in writing a spec? TechCom will be more than happy to help write/review if so. Moving to TechCom-RFC backlog for now.
There is, as stated in T142803#2833509, but like everybody else, we are slammed with more work than we have time for as a small team. Realistically, if we wanted to get this done, we have to dedicate at least a quarter to just this task (that will include cleaning up and updating our dom spec, writing a html -> wt spec, clean up our parser test infrastructure so it is a better test spec rather than the mess it is right now - see T111604: Split parser tests into multiple files, and initiate work on the other pieces). But, yes TechCom-RFC backlog is the right place for this now.
I am going ot mark this declined for now. We will probably revisit some of these discussions in the future based on these rfc / irc sessions, wiki pages, dev summit talks.