ArchCom-RFC-2016W32-irc-E259.txt
ActivePublic
Actions

Authored by • RobLa-WMF on Aug 10 2016, 10:02 PM.

Tags

None

Referenced Files

	F4351801: ArchCom-RFC-2016W32-irc-E259.txt
	Aug 10 2016, 10:02 PM

Subscribers

None

	21:00:09 <Krinkle> #startmeeting ArchCom RFC Meeting W32: A spec for Wikitext
	21:00:11 <wm-labs-meetbot`> Meeting started Wed Aug 10 21:00:09 2016 UTC and is due to finish in 60 minutes. The chair is Krinkle. Information about MeetBot at http://wiki.debian.org/MeetBot.
	21:00:11 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
	21:00:11 <wm-labs-meetbot`> The meeting name has been set to 'archcom_rfc_meeting_w32__a_spec_for_wikitext'
	21:00:34 <Krinkle> https://phabricator.wikimedia.org/E259
	21:00:48 <Krinkle> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) \| Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
	21:01:01 <legoktm> o/
	21:01:06 <Krinkle> #link https://phabricator.wikimedia.org/E259
	21:01:09 <Krinkle> Hey all
	21:01:24 <robla> o/
	21:01:42 <SMalyshev> hey
	21:01:44 * Krinkle has his first time meetbot experience
	21:01:46 <robla> thanks for chairing Krinkle!
	21:02:21 <subbu> o/
	21:03:09 <Scott_WUaS> (Yay Krinkle :)
	21:03:15 <Krinkle> The main topic for today will be about whether and how we'll proceed with the specification of wikitext. This follows after an essay Subbu wrote on mediawiki.org and a wikitech-l thread.
	21:03:17 <Krinkle> #link https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext
	21:03:23 <Krinkle> #link https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086200.html
	21:04:27 <robla> so, should we have a spec? I say "yes"
	21:05:06 <subbu> as i have argued in that essay, "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec".
	21:05:14 <subbu> and there are different specs for different needs / audiences.
	21:05:15 <SMalyshev> surely we should. It'd be quite time-consuming task though
	21:05:23 <brion> A spec for what, in what context :)
	21:05:55 <Krinkle> One of the problems people would like to see solved in this area is to be able to confidently interact with older content. The status quo is that things change with time, and that rendering is somewhat unpredictable for older revisions (expansion of templates and external links was mentioned).
	21:06:25 <gwicke> specifically, older content stored in wikitext
	21:06:29 <Krinkle> # subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
	21:06:33 <Krinkle> #info subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
	21:06:39 <brion> We could for instance specify "legacy" Wikitext as it exists, but that doesn't make it an easy system. It'll still be complex
	21:06:49 <robla> a spec is also helpful to make sure that the markup means the same thing to humans and computers
	21:06:58 <SMalyshev> I'd say at least a narrative spec from which a reasonably knowledgeable programmer would be able to implement a parser which renders at least 90% (arbitrary high percentage here) of wikitext properly
	21:07:08 <brion> I suspect we need that for the context of interacting with old archival content, even if what we use for future content changes
	21:07:33 <gwicke> SMalyshev: we pretty much have that already, in the form of a PEG grammar
	21:07:47 <subbu> gwicke, well ... tokenizer, you mean.
	21:08:14 <TimStarling> specifying "legacy" wikitext does not necesarily meet subbu's goals, e.g. "Ease implementation of tools and libraries that need to operate with wikitext directly"
	21:08:15 <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
	21:08:19 <James_F> Also the concept of "what is wikitext" differs by installation. Are the core parser functions part of the spec? What about the others? What about <gallery> syntax? Etc. Each change we make is generally breaking for people that use it.
	21:08:23 <gwicke> it covers the "easy" parts SMalyshev mentioned
	21:08:24 <SMalyshev> gwicke: PEG grammar alone doesn't seem enough for me... it specifies what can be said, but not what it means?
	21:08:25 <subbu> gwicke, there is no guarantee those tokens will render as they are tokenized.
	21:08:26 <robla> a spec can also help us comment those areas where computers will do counterintuitive things with the markup
	21:08:37 <tgr> if it would, how is that different from "call the MediaWiki parser version X with parameters Y"?
	21:08:43 <Krinkle> #info <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
	21:08:59 <subbu> gwicke, right. it is useful for sure.
	21:09:11 <Krinkle> This is an interesting end use-case. How to deal with underlying dependencies. Same goes for syntax inside extension tags.
	21:09:33 <brion> I think for future facing needs its more useful to have a clean document model
	21:09:35 <SMalyshev> tgr: I think "call the Lua compiler" is not what most people mean by spec...
	21:09:49 <legoktm> would extensions that add parser functions / tags have their own specs?
	21:10:01 <brion> For instance if layouts for tables etc were mor structured and separate from the tokens for content
	21:10:11 <tgr> SMalyshev: we do rely heavily on Lua scripts, how would you specify around that then?
	21:10:17 <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
	21:10:19 <subbu> brion, yup .. to me, as well, the most useful argument for a spec is for future needs, i.e. clean up the processing / document model.
	21:10:22 <brion> Cleaner interfaces between components such as templates and Lua modules
	21:10:27 <robla> SMalyshev: it's a reasonable spec-writing crutch to refer to other specs, even if those specs aren't very well specified
	21:10:34 <DanielK_WMDE> legoktm: ideally, yes. at least, they can't really be covered by the main spec...
	21:10:53 <SMalyshev> tgr: have docs for lua scripts... but I think we need to be declarative, not procedural, here
	21:10:57 <Krinkle> We could consider extension behaviour outside the scope of the spec. Extensions would need to either maintain indefinite compatibility, or somehow have required attributes for versioning (e.g. <foo version="">) and then decide to keep older parsers or to transform it somehow internally.
	21:11:00 <subbu> gwicke, +1
	21:11:35 <brion> One thing to consider with extensions is having a clean enough interface that we can actually do that :)
	21:11:42 <Krinkle> #info <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
	21:11:45 <subbu> i think it might be useful to address the first question: why a spec? before going into discussions about what kind of spec maybe.
	21:11:46 <robla> gwicke: subbu : do you think the implementation should be the spec, then?
	21:11:50 <James_F> Krinkle: Including inside-core 'extension' tags like <gallery>?
	21:11:54 <James_F> Yeah.
	21:11:59 <tgr> SMalyshev: wikinews uses a Lisp interpreter written in Lua for some of its templates
	21:12:00 <brion> For instance the data model of the text passed into a ref tag should be known
	21:12:05 <tgr> have fun documenting that
	21:12:06 <Krinkle> Essentially yes.
	21:12:14 <tgr> as gwicke said, it's just not a realistic goal
	21:12:15 <brion> But the data model of a classic gallery tag is a distinct domain specific language
	21:12:35 <robla> tgr: it's not realistic to invest in if we don't believe the data is very important
	21:12:38 <Krinkle> But I think it's more important for the spec to detail what the impact is of returned html from <gallery> with the rest of the content, more important than the syntax of the text inside <gallery>
	21:12:40 <subbu> robla, for legacy wikitext, i think that is the unfortunate reality.
	21:12:56 <Krinkle> #info <brion> But the data model of a classic <gallery> tag is a distinct domain specific language
	21:12:57 <gwicke> robla: that would imply that a spec would be the only way to preserve data
	21:12:57 <brion> That's something our current system doesn't grok, so we lose the ability to refer to the contents of the ref
	21:13:43 <robla> gwicke: no it doesn't. we would still have our implementation of the spec
	21:13:52 <gwicke> it's far from clear that a spec is a viable way to do that at all, and even less so that it is the only viable way
	21:14:05 <legoktm> I don't think focusing on a spec as the way to solve the old wikitext no longer works properly is a good way to frame this discussion - we should focus on the reasons we need a spec and that it could possibly help with that problem
	21:14:31 <Krinkle> One radical idea would be consider our shortcuts (extensions, templates) merely a way to provide input and create a revision, rather than being the revision itself. Essentially producing something in between that is still minimal and canonical but not end-user/localised/skinned.
	21:14:48 <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
	21:14:50 <Krinkle> that would move expansion out of scope somewhat. (e.g. does ~~~~ need a spec?)
	21:15:13 <Krinkle> #info <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
	21:15:39 * gwicke is curious
	21:15:41 <robla> goal for writing a spec: have a human readable description of how wikitext will be interpreted by computers
	21:15:57 <brion> Goal: to have a consistent way to interpret, edit, and display Wikitext from any era after the spec epoch consistently (? Sample)
	21:16:17 <DanielK_WMDE> some reasons to have a spec: allows a canonical test suit, allows transformation into other markup formats, enables multiple alternative parsers/processors (and editors)
	21:16:35 <robla> brion: I suppose the example I keep trotting out is ANSI C. Very complicated, but very worthwhile
	21:16:35 * tgr finds the MediaWiki parser more human-readable than twenty pages of ABNF
	21:16:36 <DanielK_WMDE> a spect would need to define the semantics of each syntactical element
	21:16:37 <gwicke> DanielK_WMDE: we already have a test suite
	21:16:42 <DanielK_WMDE> just an accepting grammar would be useless
	21:16:58 <gwicke> brion: does your proposal require a spec, or an implementation?
	21:17:00 <TimStarling> subbu's idea of an executable (HTML5-like) spec is interesting to me
	21:17:01 <DanielK_WMDE> gwicke: sure. so?
	21:17:09 <arlolra> maybe there's a parallel to be drawn between wikitext and js's don't break the web model
	21:17:27 <brion> gwicke: depends on what were specing
	21:17:29 <subbu> goal: write a future-looking spec to evolve the wikitext language and processing model .. which doesn't help with old wikitext of course.
	21:17:31 <DanielK_WMDE> gwicke: is it canonical? in thw way that we say if the software breaks a test, this MUST be because the software is wrong?
	21:17:43 <brion> Are we specing Wikitext the character sequence
	21:17:44 <robla> tgr: for C programming, do you think that compilers are easier to read than the ANSI C spec?
	21:17:48 <brion> Or the document model?
	21:17:49 <DanielK_WMDE> i often end up fixing parser tests, not code... because the tests make assumptions, or rely on unspecified behavior
	21:17:52 <gwicke> DanielK_WMDE: pretty much, yes
	21:17:59 <gwicke> considering the amount of stored, existing content
	21:18:07 <subbu> TimStarling, yes .. that is one way html authors probably addressed the problem of old html out there and html compatibility.
	21:18:12 <SMalyshev> brion: both I'd say. Character sequence is probably the easier part :)
	21:18:15 <brion> A document model spec gives us what we and others need to transform our parsed documents into other formats etc
	21:18:19 <TimStarling> for some purposes, it would be nice to be reductionist, write grammars etc., but for archiving and ease of reimplementation it makes more sense to be complete
	21:18:27 <subbu> which is probably not dissimilar to the current problem we have.
	21:18:43 <TimStarling> and that really means specifying algorithms
	21:18:43 <brion> Even if we only have one implementation of the tokenizer/parser
	21:18:43 <DanielK_WMDE> gwicke: the best spec is the stored, existing content, then. if you break it, you are doing it wrong...
	21:18:47 <Krinkle> TimStarling: subbu: an executable spec, would that mean expansion is part of the model, or left to producers of content (e.g. notion of a "template" could be in the data atrributes, but not required for consumers to understand)
	21:18:55 <gwicke> that's how tests are set up, and parsoid was tested
	21:19:08 <robla> another goal: interoperability between multiple implementations
	21:19:08 <tgr> robla: no clue about that, but there are several magnitudes in difference between the resources behind C and wikitext, and also the potential userbase, so I don't think it's a useful comparison
	21:19:22 <SMalyshev> DanielK_WMDE: the problem it's not a constructive spec :) It doesn't giv you a way to do it right, only tells you when it's wrong
	21:19:25 <DanielK_WMDE> robla: hell yea
	21:19:32 <subbu> so, let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0
	21:19:41 <subbu> the former lets you deal with old content.
	21:19:43 <brion> :)
	21:19:50 <subbu> the latter lets you clean up wikitext and move forward. :)
	21:19:58 <DanielK_WMDE> SMalyshev: many things in life are like that ;)
	21:19:58 <brion> I like
	21:20:04 <Krinkle> #info <DanielK_WMDE> just an accepting grammar would be useless
	21:20:14 <robla> subbu: is there an example of a really good executable spec?
	21:20:14 <subbu> DanielK_WMDE, i agree reg. an accept grammar.
	21:20:17 <subbu> html5
	21:20:24 <TimStarling> Krinkle: I think expansion would have to be fully part of the executable spec, including what version of Lua to use etc.
	21:20:37 <brion> Subbu do you imagine a common document model attached to both old and new grammars?
	21:20:40 <DanielK_WMDE> well, "yes" is an accepting grammar
	21:20:47 <DanielK_WMDE> since all text is valid wikitext
	21:20:50 <brion> Heh
	21:20:55 <Krinkle> #info <brion> Are we specing Wikitext the character sequence <brion> Or the document model?
	21:21:03 <TimStarling> this is what subbu means by an executable spec: https://www.w3.org/TR/html5/syntax.html
	21:21:06 * subbu is trying to find a link to the html5 tree building algo.
	21:21:12 <subbu> oh, there TimStarling posted it
	21:21:16 <TimStarling> it's a natural language description of an algorithm
	21:21:25 * robla looks at the link
	21:21:33 <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format
	21:21:52 * robla sees a lot of English in that ;-)
	21:21:53 <gwicke> I tend to agree, it's not a good storage format
	21:21:57 <DanielK_WMDE> i thin we need both: characters -> AST -> semantics. We could rely on the HTML spec for some parts of the AST and semantics.
	21:22:18 <subbu> brion, i didn't understand your qn. reg. common document model
	21:22:22 <tgr> and then the spec would throw up its hands any time an extension is involved? what is the use case for that?
	21:22:29 <brion> gwicke: yes, though our main alternative now is HTML which is at the wrong abstraction level :)
	21:22:41 <Krinkle> #info <subbu> let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0. the former lets you deal with old content.
	21:22:41 <gwicke> brion: is it?
	21:22:46 <tgr> to be able to write an alternative parser that absolutely cannot handle real wikitext as it appears on Wikipedia?
	21:22:48 <Krinkle> #info <subbu> the latter lets you clean up wikitext and move forward.
	21:23:02 <gwicke> brion: which issues do you see in the Parsoid DOM spec?
	21:23:10 <Scott_WUaS> (If there's a choice in spec development between archiving a lot/all and archiving some, I'd be for the former, with some limitations).
	21:23:13 <brion> subbu: I'm thinking like, would old and new parser rules end up creating compatible in memory object representations that could be transformed to one another
	21:23:20 <DanielK_WMDE> robla: "Script data double escaped less-than sign state" <-- do we really want to go there?
	21:23:43 <Krinkle> #info <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format. <gwicke> I tend to agree, it's not a good storage format
	21:23:48 <subbu> brion, ah .. i see. well, parsoid is an example implementation that can bridge between the two.
	21:23:52 <Platonides> I don't think separating old/new wikitext format is workable
	21:24:07 <subbu> Platonides, content-handler.
	21:24:07 <brion> gwicke: in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. nothing wrong with Parsoid but it makes it inefficient and easy to change in ways that will be weird
	21:24:13 <Platonides> lots of old-page calling new-template
	21:24:24 <brion> I'd love a model that's slightly higher level than HTML
	21:24:25 <Platonides> or in other order
	21:24:28 <SMalyshev> tgr: I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. But we can get at least basic syntax?
	21:24:40 <Platonides> wiktitext fragments being passed as parameters...
	21:24:48 <TimStarling> DanielK_WMDE: it's simpler to write that than to try to reduce the existing algorithm to a formal description
	21:24:54 <gwicke> brion: it's trivial to simplify to XML, but then you lose the rendering
	21:25:22 <brion> Yes, you also lose the rendering if your details change like URL structure
	21:25:23 <gwicke> in any case, the purpose of the DOM model is to clearly define what is semantically important, and what is just one way to format it
	21:25:35 <brion> Yep
	21:25:39 <brion> Dom is good :)
	21:25:52 <Scott_WUaS> (If MediaWiki Content Translation became part of this spec writing, or similar, in what ways would it be best to write a spec to facilitate much translation of past resources?)
	21:26:00 <gwicke> otherwise, Parsoid could just have said "here, it's HTML5"
	21:26:13 <gwicke> & not bothered with a DOM sepc
	21:26:17 <Krinkle> brion: We can express a higher model in HTML potentially. Parsoid does that to some extent already when an expansion stage exists (e.g. we could have <mw-image> instead of <img>, the model doens't need to be renderable in browsers as-is per se)
	21:26:17 <gwicke> *spec
	21:26:22 <brion> Scott_WUaS: persistent ids for content snippets that don't interfere with human readability
	21:26:38 <Krinkle> #info <brion> in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc.
	21:26:42 <Debra> subbu: What were you hoping to accomplish in this session?
	21:26:46 <Scott_WUaS> brion: thanks
	21:26:55 <brion> Krinkle: yes, that's about what I'm thinking :)
	21:26:58 <DanielK_WMDE> hm, has anyone looked at the attempt to build an ANTLR grammar for wikitext? If I remember corectly, the effort came quite far
	21:26:59 <subbu> brion, but to answer your qn. i think using that is an implementation question ... but, the compatibility between old & new would be bridged via the output spec perhaps .. i am thinking aloud here.
	21:27:15 <Krinkle> brion: OKay, so you didn't mean that the higher model should be something that isn't HTML syntax
	21:27:28 <Platonides> DanielK_WMDE: but failed as always
	21:27:30 <gwicke> DanielK_WMDE: there is not much benefit over PEG
	21:27:32 * subbu is personally not interested in the grammar as a spec direction
	21:27:33 <brion> Kringle yeah I'm agnostic to syntax really
	21:27:34 <Platonides> if I remember right
	21:27:36 <brion> Heh autocorrect
	21:27:38 <Krinkle> (it could be browser renderable too, with custom elements nowadays)
	21:27:45 <Platonides> the first 90% is easy
	21:27:49 <DanielK_WMDE> Platonides: sure, it failed to cover the critical last 5%, but perhaps it's a good start. or a good lesson.
	21:27:54 <Platonides> but the last 10% makes you mad
	21:27:59 <DanielK_WMDE> for reference: https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
	21:28:01 <TimStarling> note that we don't actually have a full PEG spec
	21:28:14 <Krinkle> #info <SMalyshev> I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates.
	21:28:18 <subbu> Debra, I was responding to robla's call which I felt was a good question to delve into about a spec ... to understand where perspectives are wrt to wikitext, spec, old / new wikitext, evolving wikitext, etc.
	21:28:25 <gwicke> a full grammar spec is impossible anyway
	21:28:35 <subbu> Debra, so far, i think it is being met well. as far as I am concerned, not sure what robla thinks. :)
	21:28:41 <Debra> All right.
	21:28:44 <gwicke> unless the grammar is fully turing complete
	21:28:52 <robla> subbu: I think I mostly agree with you (not interested in grammar being the sole focus), but I do think syntax is important
	21:28:52 <Debra> I think general discussion is fine, but sometimes people want more concrete action items from these meetings.
	21:29:04 <TimStarling> we have PEG+stops, and I think it would be interesting to replace stops with an extension to the PEG formalism, instead of being implemented in JS
	21:29:21 <gwicke> stops are just a compression technique
	21:29:26 <TimStarling> since we are probably forking PEG.js anyway
	21:29:30 <gwicke> you can unroll that into a larger grammar
	21:29:47 <TimStarling> yeah, but unrolling defeats the purpose of having the grammar, the purpose is reduction
	21:29:50 <Krinkle> #info <subbu> * is personally not interested in the grammar as a spec direction
	21:29:51 <gwicke> very tedious, but possible
	21:30:21 <Krinkle> #link https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
	21:30:32 <subbu> robla, i think wikitext syntax parsing is a solved problem ... the peg grammer with stops is good enough .. TimStarling also pointed me to an ebnf grammar for the php preprocessor .. i think between the two, we have wikitext syntax tokenization covered.
	21:30:56 <DanielK_WMDE> do we have a link to the PEG thing?
	21:30:58 <subbu> but, that syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
	21:31:06 <brion> Full data model is more complex potentially yes!
	21:31:08 <TimStarling> DanielK_WMDE: you won't like it
	21:31:11 <subbu> :)
	21:31:20 <gwicke> DanielK_WMDE: https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
	21:31:35 <robla> subbu: yup, the latter is way more interesting (understanding wikitext semantics or generaitng html from it)
	21:31:45 <Krinkle> #info <subbu> syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
	21:31:49 <DanielK_WMDE> gwicke: thanks
	21:31:56 <brion> subbu: does the Parsoid Dom model help in terms of defining some elements?
	21:31:59 <Krinkle> subbu: Exactly, we need to specify not the syntax, but what the tokens mean in relation to each other.
	21:32:00 <DanielK_WMDE> #link https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
	21:32:12 <brion> Or do we need to go farther with semantic info
	21:32:18 <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
	21:32:19 <subbu> brion, https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext#What_kind_of_specs_can_we_develop.3F addresses your qn i think
	21:32:20 <brion> Eg, here's how a section is ordered
	21:32:35 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Markup_spec
	21:32:41 <gwicke> I'm personally most interested in developing the DOM spec further, as well as cleaning up the messy semantics around transclusion & templating
	21:33:00 <TimStarling> the awkward thing about ABNF is that precedence is unspecified
	21:33:07 <TimStarling> in PEG, there is a specified precedence
	21:33:10 <Krinkle> #info <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
	21:33:24 <brion> Subbu: Yes good :)
	21:33:26 <subbu> robla, for semantics ... i think TimStarling is on the right track about an executable spec .. since we have html5 tree builder as a successful model for dealing with legacy formats.
	21:33:33 <SMalyshev> gwicke: wow that file breaks syntax highlighter :)
	21:33:42 <subbu> but, i think if we stuck just there, that would be unfortunate.
	21:33:47 <subbu> if we were ..
	21:33:55 <gwicke> SMalyshev: I think the .txt suffix throws it off
	21:33:56 <robla> (a side note for us to cover in the last few minutes: should we have a Phab task to track the state of Parsing/Notes/A_Spec_For_Wikitext and turn the latter into an RFC)
	21:34:00 <brion> The idea that extensions may need some more description is on point I think.
	21:34:32 <brion> Params for modules/templates/extensions may themselves be Wikitext, or may be just little string tokens
	21:34:47 <Krinkle> subbu: TimStarling: So the executable spec would dictate that when encountering <mw-template> (or {{template) that it include that target in a certain way?
	21:34:52 <brion> If you want a library that greps or search replaces in text, that matters to you
	21:35:25 <subbu> Krinkle, it would define that that token be preprocessed to generate expanded wikitext, for example.
	21:35:35 <subbu> a reference implementation does not have to be a high-performance implementation.
	21:35:37 <brion> Similar to needing to know that some HTML tags are self closing or etc
	21:35:51 <subbu> it can be very slow, but built with the goal of being understandable and easy to grok.
	21:35:57 <DanielK_WMDE> TimStarling: hm... much of the problematic parts are related to tag extensions, parser functions, and other transclusion mechanisms. it seems to be the preprocessor is a tool to separate these from the "wikitext proper" parts. that would perhaps make it easier to write a spec just for these parts.
	21:36:20 <Krinkle> subbu: Is the goal for the spec to allow third-parties to do what MediaWiki does now when viewing an old revision? (e.g. current version of templates) - or do we intend to improve that behaviour as part of this?
	21:36:25 <Platonides> the preprocessor is relatively easy
	21:36:30 <Platonides> "relatively"
	21:36:44 <brion> Heh
	21:36:45 <gwicke> "just do it as this code does"
	21:36:46 <subbu> Krinkle, that is a reasonable goal, yes.
	21:36:47 <DanielK_WMDE> Krinkle: i think an executable speci is called a "parser".
	21:36:55 <Krinkle> (or perhaps even move it out of the spec, by as gwicke mentioned, to consider that only an input method to the model, and never a storage format)
	21:37:33 <robla> the model would still have a canonical disk representation
	21:37:49 <subbu> DanielK_WMDE, maybe .. but, I think it would also extract the most important semantics out of the guts of mediawiki.
	21:38:09 <DanielK_WMDE> a really well written, nicely readable parser could server as a spec. probably a recursive descend parser.
	21:38:25 <gwicke> DanielK_WMDE: PEG is recursive descent
	21:38:28 <subbu> i.e. how far can you pull the parser out and see how much spaghetti links back into the mediawiki guts you can cut out without breaking the essential interpretation of content.
	21:38:44 <subbu> so, for example, red links, etc. may not be essential in the executable spec.
	21:39:00 <subbu> they are all best viewed as post-parser transformations even for old revisions.
	21:39:07 <subbu> and need not be part of the spec.
	21:39:15 <DanielK_WMDE> gwicke: so is the PEG code sufficiently readable that others could reasonably use it to build their own grammaer or parser?
	21:39:27 <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
	21:39:35 <gwicke> DanielK_WMDE: yes, it even compiles to a tokenizer out of the box
	21:39:55 <gwicke> it's still not easy, but there is only so much complexity you can pretend to not be there
	21:40:24 <robla> focusing on making the spec executable seems like a "nice to have", not a hard and fast requirement. It seems like overengineering
	21:40:48 <DanielK_WMDE> gwicke: can we package the complexity into nice bundles, that can be covered or ignored? a modular spec?
	21:40:57 <Krinkle> In other words, if a third party has an xml dump of all page titles and revisions, can they use this to figure out how to render them? Or do we involve other stuff that wouldn't be in there. References to other pages is doable I guess, but references to other stuff gets more complicated ({{int:}}, {{gender:}}, including special pages, extension tags)
	21:41:08 <DanielK_WMDE> (to an extent, all specs are modular, since they all build on top of pre-established conventions)
	21:41:17 <gwicke> like wikitext-without-italic-and-bold?
	21:41:19 <brion> Mediawiki standard library ;)
	21:41:22 <TimStarling> the PEG grammar would be more readable if the JS event code was separated from the PEG recognizer
	21:41:25 <robla> DanielK_WMDE: yes, exactuly (builidng on other specs)
	21:41:31 <Krinkle> #info <subbu> so, for example, red links, etc. may not be essential in the executable spec.
	21:41:33 <TimStarling> many PEG libraries actually do that
	21:41:57 <Krinkle> #info <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
	21:42:16 <Krinkle> I think the executable spec would be written in natural language. The HTML5 executable spec is that way.
	21:42:17 <gwicke> an incomplete spec isn't useful for the archival problem
	21:42:30 <brion> Krinkle: I like the idea of a layered spec. Things like extensions and parser functions are an additional layer needed for some uses but not all
	21:42:33 <arlolra> TimStarling: we should do that
	21:43:02 <James_F> gwicke: Well, a parser/etc. for the log wikitext sub-type (bold/italics/links and nothing else) is probably needed.
	21:43:17 <robla> gwicke: I think usefulness a matter of degrees, not binary. A really complete spec is most useful, but an incomplete spec can be useful.
	21:43:19 <gwicke> also, no links / images, I guess
	21:43:21 <brion> Eg if mediawiki burns in a fire and we have all this data, what do we need to separate out the various levels of data in its place
	21:43:27 <gwicke> as those depend on the target of the link
	21:43:39 <brion> And how can we rearrange and refactor internally to represent those layers more maintainable
	21:43:43 * subbu is trying to imagine software bits burning in a fire ...
	21:43:46 <brion> Hehe
	21:43:46 <Krinkle> If I understand correctly, Parsoid currently considers extension tags as instructions to make a request for generated content (aside from the few ones it implements natively). So it would depend on the availability of an HTTP service.
	21:44:05 <Krinkle> this is extendable, but probably not desireable for the archival use case.
	21:44:21 <gwicke> hey, you can make each extension its own spec
	21:44:36 <subbu> Krinkle, wikitext-native extensions need to have a parsoid-native equivalent.
	21:44:49 <subbu> i mean .. extensions that process wikitext.
	21:44:57 <subbu> ex: ref, gallery
	21:45:18 <James_F> And then there's extensions which don't process wikitext but do depend on it, like <timeline>
	21:45:43 <gwicke> the old cobol folks would have smugly argued that their language was so close to human readable that they wouldn't need to bother writing separate prose
	21:46:10 <robla> gwicke: :-)
	21:46:12 <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
	21:46:15 <brion> Hehe
	21:46:24 <DanielK_WMDE> it has the advantage that we already have half of it.
	21:46:48 <DanielK_WMDE> it would not cover all layers. the missine layers should have well defined interfaces.
	21:46:51 <subbu> DanielK_WMDE, it only tokenizes right now.
	21:47:05 <DanielK_WMDE> well, that's at least the first layer
	21:47:06 <robla> #info <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
	21:47:17 <TimStarling> my position on project planning is that a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec
	21:47:43 <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
	21:47:49 <gwicke> it's more likely to actually happen
	21:47:58 <gwicke> but for archival purposes, it doesn't seem to have much value
	21:48:04 <Krinkle> #info <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
	21:48:09 <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
	21:48:26 <TimStarling> maybe we should be storing parsoid HTML before and after each VE edit for the same reason
	21:48:29 <Krinkle> For historical review and annotation/blame
	21:48:40 <robla> #info <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
	21:49:05 <gwicke> we already store Parsoid HTML for each edit
	21:49:10 <gwicke> but only going forward, not for the past
	21:49:19 <DanielK_WMDE> gwicke: can we publish dumps of that?
	21:49:24 <subbu> TimStarling, given that parsoid converts that back to equivalent wikitext, why would you need to store parsoid html before/after each ve edit?
	21:49:40 <gwicke> DanielK_WMDE: yes, subject to some attention from ops
	21:49:47 <brion> Well there's the whole template update issue
	21:49:48 <DanielK_WMDE> subbu: because parsoid changes
	21:49:52 <brion> That too
	21:50:05 <DanielK_WMDE> brion: yes, indeed
	21:50:07 <Krinkle> one consideration with storing expanded canonical form is oversight/deletion.
	21:50:19 <subbu> brion, but, that is a generic wikitext problem, not is not restricted to ve edits.
	21:50:25 <TimStarling> yeah, because parsoid changes, but I'm not going to try to sell you this idea right now
	21:50:26 <brion> Yup
	21:50:49 <gwicke> any spec needs versioning & format upgrades
	21:51:09 <robla> so....this seems worthy of being an RFC going forward, no?
	21:51:11 <Krinkle> #info <gwicke> we already store Parsoid HTML for each edit <DanielK_WMDE> gwicke: can we publish dumps of that? <gwicke> DanielK_WMDE: yes, subject to some attention from ops
	21:51:22 <TimStarling> maybe it should be noted what an abysmally bad job we are doing with historical preservation of edits, for reasons unrelated to a spec
	21:51:36 <DanielK_WMDE> #info the idea that stroing (and possibly publishing) parsoid HTML for each revision for achieval seems to have some support
	21:51:41 <TimStarling> for example, most of the first 12 months of the history of the project are still missing
	21:51:46 <gwicke> DanielK_WMDE: https://phabricator.wikimedia.org/T133547
	21:52:01 <gwicke> an old dump at https://dumps.wikimedia.org/htmldumps/dumps/
	21:52:37 <robla> TimStarling: I wouldn't call that abysmally bad, but I agree it'd be really fantastic to make the first 12 months more easily accessible
	21:52:41 <TimStarling> 4 of those 12 months only exist in a landfill probably somewhere in Orange County
	21:53:19 <gwicke> faithfully rendering early history would require changes similar to what the Memento project did
	21:53:27 <TimStarling> the other 8 just nobody could be bothered importing
	21:53:28 <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
	21:53:31 * robla is now depressed after Tim's landfill remark :-( (because he imagines Tim's not wrong)
	21:53:41 <gwicke> and even that would only go half way at best
	21:53:53 <brion> subbu: :)
	21:54:02 <gwicke> but, fortunately early content is also a lot simpler, so in practice not much actual content should be lost
	21:54:18 <brion> I love the idea of specing things like extension input models better
	21:54:22 <Krinkle> #info <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
	21:54:26 <Platonides> maybe we could start importing that early data
	21:54:46 <Platonides> then we could more optimistacally go ahead with the future wikitext
	21:54:47 <gwicke> subbu: I wouldn't call it wikitext spec
	21:54:51 <robla> subbu, it'd be great for you to file a stub Phab task for us to track state of the wiki page, could you do that?
	21:55:08 <gwicke> "wiki content processing spec" or "wiki content model spec"?
	21:55:16 <subbu> gwicke, sure .. <insert-name-here> spec
	21:55:26 * DanielK_WMDE whispers "hooks" @brion
	21:55:30 <brion> subbu: yeah, content model or document model perhaps is the angle
	21:55:41 <gwicke> the page component / composition stuff is aiming in that direction as well
	21:55:56 <subbu> robla, i didn't follow reg. "track state of the wiki page" part .. can you say more?
	21:55:56 <Krinkle> brion: If we go with the route of promoting (expanded, but annotated) html as archival format, that would remove dependencies like extensions. We'd store wikitext for review only (as direct input, no intent to re-parse).
	21:55:56 <robla> we can morph https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext into an RFC
	21:55:56 * brion is a pirate and has hooks for hands. Arrrrr!
	21:55:56 <subbu> ah, morphing that into a rfc .. task for that?
	21:56:16 <subbu> if yes, sure. i can.
	21:56:18 <brion> Krinkle: mmmmm, depends on the extension. If it needs js, your life is harder
	21:56:22 <robla> subbu, yup
	21:56:44 <robla> #action subbu file a Phab task to track https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext for possible conversion to RFC
	21:56:58 <gwicke> the fun part is that anything new will have to consider existing content
	21:57:01 <Krinkle> brion: enhancement (same for CSS). Presumably the core content wouldn't depend on JS. Interaction and styling are up to the consumer to decide on.
	21:57:11 <Scott_WUaS> (Cheers:)
	21:57:12 <brion> nod
	21:57:15 <Krinkle> If not, that's a bug in the extension :)
	21:57:17 <Platonides> "fun"
	21:57:18 <brion> :)
	21:57:20 <Krinkle> (and we have some)
	21:57:33 <Krinkle> Any last thoughts?
	21:57:56 <gwicke> specs are hard
	21:57:58 <Krinkle> #info subbu to create RFC
	21:58:05 <brion> Specs and models for everyooooooone
	21:58:24 <Krinkle> #info <gwicke> specs are hard
	21:58:29 <robla> :-)
	21:58:30 <brion> True :)
	21:58:36 <Krinkle> #info <brion> Specs and models for everyooooooone
	21:58:38 <Krinkle> #endmeeting

Event Timeline

This is the log from E259: ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office)

daniel mentioned this in E259: ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office).Dec 9 2016, 7:42 AM

ArchCom-RFC-2016W32-irc-E259.txtActivePublicActions

Event Timeline

ArchCom-RFC-2016W32-irc-E259.txt
ActivePublic
Actions