Page MenuHomePhabricator
Paste P3815

ArchCom-RFC-2016W32-irc-E259.txt
ActivePublic

Authored by RobLa-WMF on Aug 10 2016, 10:02 PM.
21:00:09 <Krinkle> #startmeeting ArchCom RFC Meeting W32: A spec for Wikitext
21:00:11 <wm-labs-meetbot`> Meeting started Wed Aug 10 21:00:09 2016 UTC and is due to finish in 60 minutes. The chair is Krinkle. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:11 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:11 <wm-labs-meetbot`> The meeting name has been set to 'archcom_rfc_meeting_w32__a_spec_for_wikitext'
21:00:34 <Krinkle> https://phabricator.wikimedia.org/E259
21:00:48 <Krinkle> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
21:01:01 <legoktm> o/
21:01:06 <Krinkle> #link https://phabricator.wikimedia.org/E259
21:01:09 <Krinkle> Hey all
21:01:24 <robla> o/
21:01:42 <SMalyshev> hey
21:01:44 * Krinkle has his first time meetbot experience
21:01:46 <robla> thanks for chairing Krinkle!
21:02:21 <subbu> o/
21:03:09 <Scott_WUaS> (Yay Krinkle :)
21:03:15 <Krinkle> The main topic for today will be about whether and how we'll proceed with the specification of wikitext. This follows after an essay Subbu wrote on mediawiki.org and a wikitech-l thread.
21:03:17 <Krinkle> #link https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext
21:03:23 <Krinkle> #link https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086200.html
21:04:27 <robla> so, should we have a spec? I say "yes"
21:05:06 <subbu> as i have argued in that essay, "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec".
21:05:14 <subbu> and there are different specs for different needs / audiences.
21:05:15 <SMalyshev> surely we should. It'd be quite time-consuming task though
21:05:23 <brion> A spec for what, in what context :)
21:05:55 <Krinkle> One of the problems people would like to see solved in this area is to be able to confidently interact with older content. The status quo is that things change with time, and that rendering is somewhat unpredictable for older revisions (expansion of templates and external links was mentioned).
21:06:25 <gwicke> specifically, older content stored in wikitext
21:06:29 <Krinkle> # subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
21:06:33 <Krinkle> #info subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
21:06:39 <brion> We could for instance specify "legacy" Wikitext as it exists, but that doesn't make it an easy system. It'll still be complex
21:06:49 <robla> a spec is also helpful to make sure that the markup means the same thing to humans and computers
21:06:58 <SMalyshev> I'd say at least a narrative spec from which a reasonably knowledgeable programmer would be able to implement a parser which renders at least 90% (arbitrary high percentage here) of wikitext properly
21:07:08 <brion> I suspect we need that for the context of interacting with old archival content, even if what we use for future content changes
21:07:33 <gwicke> SMalyshev: we pretty much have that already, in the form of a PEG grammar
21:07:47 <subbu> gwicke, well ... tokenizer, you mean.
21:08:14 <TimStarling> specifying "legacy" wikitext does not necesarily meet subbu's goals, e.g. "Ease implementation of tools and libraries that need to operate with wikitext directly"
21:08:15 <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
21:08:19 <James_F> Also the concept of "what is wikitext" differs by installation. Are the core parser functions part of the spec? What about the others? What about <gallery> syntax? Etc. Each change we make is generally breaking for people that use it.
21:08:23 <gwicke> it covers the "easy" parts SMalyshev mentioned
21:08:24 <SMalyshev> gwicke: PEG grammar alone doesn't seem enough for me... it specifies what can be said, but not what it means?
21:08:25 <subbu> gwicke, there is no guarantee those tokens will render as they are tokenized.
21:08:26 <robla> a spec can also help us comment those areas where computers will do counterintuitive things with the markup
21:08:37 <tgr> if it would, how is that different from "call the MediaWiki parser version X with parameters Y"?
21:08:43 <Krinkle> #info <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
21:08:59 <subbu> gwicke, right. it is useful for sure.
21:09:11 <Krinkle> This is an interesting end use-case. How to deal with underlying dependencies. Same goes for syntax inside extension tags.
21:09:33 <brion> I think for future facing needs its more useful to have a clean document model
21:09:35 <SMalyshev> tgr: I think "call the Lua compiler" is not what most people mean by spec...
21:09:49 <legoktm> would extensions that add parser functions / tags have their own specs?
21:10:01 <brion> For instance if layouts for tables etc were mor structured and separate from the tokens for content
21:10:11 <tgr> SMalyshev: we do rely heavily on Lua scripts, how would you specify around that then?
21:10:17 <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
21:10:19 <subbu> brion, yup .. to me, as well, the most useful argument for a spec is for future needs, i.e. clean up the processing / document model.
21:10:22 <brion> Cleaner interfaces between components such as templates and Lua modules
21:10:27 <robla> SMalyshev: it's a reasonable spec-writing crutch to refer to other specs, even if those specs aren't very well specified
21:10:34 <DanielK_WMDE> legoktm: ideally, yes. at least, they can't really be covered by the main spec...
21:10:53 <SMalyshev> tgr: have docs for lua scripts... but I think we need to be declarative, not procedural, here
21:10:57 <Krinkle> We could consider extension behaviour outside the scope of the spec. Extensions would need to either maintain indefinite compatibility, or somehow have required attributes for versioning (e.g. <foo version="">) and then decide to keep older parsers or to transform it somehow internally.
21:11:00 <subbu> gwicke, +1
21:11:35 <brion> One thing to consider with extensions is having a clean enough interface that we can actually do that :)
21:11:42 <Krinkle> #info <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
21:11:45 <subbu> i think it might be useful to address the first question: why a spec? before going into discussions about what kind of spec maybe.
21:11:46 <robla> gwicke: subbu : do you think the implementation should be the spec, then?
21:11:50 <James_F> Krinkle: Including inside-core 'extension' tags like <gallery>?
21:11:54 <James_F> Yeah.
21:11:59 <tgr> SMalyshev: wikinews uses a Lisp interpreter written in Lua for some of its templates
21:12:00 <brion> For instance the data model of the text passed into a ref tag should be known
21:12:05 <tgr> have fun documenting that
21:12:06 <Krinkle> Essentially yes.
21:12:14 <tgr> as gwicke said, it's just not a realistic goal
21:12:15 <brion> But the data model of a classic gallery tag is a distinct domain specific language
21:12:35 <robla> tgr: it's not realistic to invest in if we don't believe the data is very important
21:12:38 <Krinkle> But I think it's more important for the spec to detail what the impact is of returned html from <gallery> with the rest of the content, more important than the syntax of the text inside <gallery>
21:12:40 <subbu> robla, for legacy wikitext, i think that is the unfortunate reality.
21:12:56 <Krinkle> #info <brion> But the data model of a classic <gallery> tag is a distinct domain specific language
21:12:57 <gwicke> robla: that would imply that a spec would be the only way to preserve data
21:12:57 <brion> That's something our current system doesn't grok, so we lose the ability to refer to the contents of the ref
21:13:43 <robla> gwicke: no it doesn't. we would still have our implementation of the spec
21:13:52 <gwicke> it's far from clear that a spec is a viable way to do that at all, and even less so that it is the only viable way
21:14:05 <legoktm> I don't think focusing on a spec as the way to solve the old wikitext no longer works properly is a good way to frame this discussion - we should focus on the reasons we need a spec and that it could possibly help with that problem
21:14:31 <Krinkle> One radical idea would be consider our shortcuts (extensions, templates) merely a way to provide input and create a revision, rather than being the revision itself. Essentially producing something in between that is still minimal and canonical but not end-user/localised/skinned.
21:14:48 <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
21:14:50 <Krinkle> that would move expansion out of scope somewhat. (e.g. does ~~~~ need a spec?)
21:15:13 <Krinkle> #info <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
21:15:39 * gwicke is curious
21:15:41 <robla> goal for writing a spec: have a human readable description of how wikitext will be interpreted by computers
21:15:57 <brion> Goal: to have a consistent way to interpret, edit, and display Wikitext from any era after the spec epoch consistently (? Sample)
21:16:17 <DanielK_WMDE> some reasons to have a spec: allows a canonical test suit, allows transformation into other markup formats, enables multiple alternative parsers/processors (and editors)
21:16:35 <robla> brion: I suppose the example I keep trotting out is ANSI C. Very complicated, but very worthwhile
21:16:35 * tgr finds the MediaWiki parser more human-readable than twenty pages of ABNF
21:16:36 <DanielK_WMDE> a spect would need to define the semantics of each syntactical element
21:16:37 <gwicke> DanielK_WMDE: we already have a test suite
21:16:42 <DanielK_WMDE> just an accepting grammar would be useless
21:16:58 <gwicke> brion: does your proposal require a spec, or an implementation?
21:17:00 <TimStarling> subbu's idea of an executable (HTML5-like) spec is interesting to me
21:17:01 <DanielK_WMDE> gwicke: sure. so?
21:17:09 <arlolra> maybe there's a parallel to be drawn between wikitext and js's don't break the web model
21:17:27 <brion> gwicke: depends on what were specing
21:17:29 <subbu> goal: write a future-looking spec to evolve the wikitext language and processing model .. which doesn't help with old wikitext of course.
21:17:31 <DanielK_WMDE> gwicke: is it canonical? in thw way that we say if the software breaks a test, this MUST be because the software is wrong?
21:17:43 <brion> Are we specing Wikitext the character sequence
21:17:44 <robla> tgr: for C programming, do you think that compilers are easier to read than the ANSI C spec?
21:17:48 <brion> Or the document model?
21:17:49 <DanielK_WMDE> i often end up fixing parser tests, not code... because the tests make assumptions, or rely on unspecified behavior
21:17:52 <gwicke> DanielK_WMDE: pretty much, yes
21:17:59 <gwicke> considering the amount of stored, existing content
21:18:07 <subbu> TimStarling, yes .. that is one way html authors probably addressed the problem of old html out there and html compatibility.
21:18:12 <SMalyshev> brion: both I'd say. Character sequence is probably the easier part :)
21:18:15 <brion> A document model spec gives us what we and others need to transform our parsed documents into other formats etc
21:18:19 <TimStarling> for some purposes, it would be nice to be reductionist, write grammars etc., but for archiving and ease of reimplementation it makes more sense to be complete
21:18:27 <subbu> which is probably not dissimilar to the current problem we have.
21:18:43 <TimStarling> and that really means specifying algorithms
21:18:43 <brion> Even if we only have one implementation of the tokenizer/parser
21:18:43 <DanielK_WMDE> gwicke: the best spec is the stored, existing content, then. if you break it, you are doing it wrong...
21:18:47 <Krinkle> TimStarling: subbu: an executable spec, would that mean expansion is part of the model, or left to producers of content (e.g. notion of a "template" could be in the data atrributes, but not required for consumers to understand)
21:18:55 <gwicke> that's how tests are set up, and parsoid was tested
21:19:08 <robla> another goal: interoperability between multiple implementations
21:19:08 <tgr> robla: no clue about that, but there are several magnitudes in difference between the resources behind C and wikitext, and also the potential userbase, so I don't think it's a useful comparison
21:19:22 <SMalyshev> DanielK_WMDE: the problem it's not a constructive spec :) It doesn't giv you a way to do it right, only tells you when it's wrong
21:19:25 <DanielK_WMDE> robla: hell yea
21:19:32 <subbu> so, let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0
21:19:41 <subbu> the former lets you deal with old content.
21:19:43 <brion> :)
21:19:50 <subbu> the latter lets you clean up wikitext and move forward. :)
21:19:58 <DanielK_WMDE> SMalyshev: many things in life are like that ;)
21:19:58 <brion> I like
21:20:04 <Krinkle> #info <DanielK_WMDE> just an accepting grammar would be useless
21:20:14 <robla> subbu: is there an example of a really good executable spec?
21:20:14 <subbu> DanielK_WMDE, i agree reg. an accept grammar.
21:20:17 <subbu> html5
21:20:24 <TimStarling> Krinkle: I think expansion would have to be fully part of the executable spec, including what version of Lua to use etc.
21:20:37 <brion> Subbu do you imagine a common document model attached to both old and new grammars?
21:20:40 <DanielK_WMDE> well, "yes" is an accepting grammar
21:20:47 <DanielK_WMDE> since all text is valid wikitext
21:20:50 <brion> Heh
21:20:55 <Krinkle> #info <brion> Are we specing Wikitext the character sequence <brion> Or the document model?
21:21:03 <TimStarling> this is what subbu means by an executable spec: https://www.w3.org/TR/html5/syntax.html
21:21:06 * subbu is trying to find a link to the html5 tree building algo.
21:21:12 <subbu> oh, there TimStarling posted it
21:21:16 <TimStarling> it's a natural language description of an algorithm
21:21:25 * robla looks at the link
21:21:33 <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format
21:21:52 * robla sees a lot of English in that ;-)
21:21:53 <gwicke> I tend to agree, it's not a good storage format
21:21:57 <DanielK_WMDE> i thin we need both: characters -> AST -> semantics. We could rely on the HTML spec for some parts of the AST and semantics.
21:22:18 <subbu> brion, i didn't understand your qn. reg. common document model
21:22:22 <tgr> and then the spec would throw up its hands any time an extension is involved? what is the use case for that?
21:22:29 <brion> gwicke: yes, though our main alternative now is HTML which is at the wrong abstraction level :)
21:22:41 <Krinkle> #info <subbu> let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0. the former lets you deal with old content.
21:22:41 <gwicke> brion: is it?
21:22:46 <tgr> to be able to write an alternative parser that absolutely cannot handle real wikitext as it appears on Wikipedia?
21:22:48 <Krinkle> #info <subbu> the latter lets you clean up wikitext and move forward.
21:23:02 <gwicke> brion: which issues do you see in the Parsoid DOM spec?
21:23:10 <Scott_WUaS> (If there's a choice in spec development between archiving a lot/all and archiving some, I'd be for the former, with some limitations).
21:23:13 <brion> subbu: I'm thinking like, would old and new parser rules end up creating compatible in memory object representations that could be transformed to one another
21:23:20 <DanielK_WMDE> robla: "Script data double escaped less-than sign state" <-- do we really want to go there?
21:23:43 <Krinkle> #info <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format. <gwicke> I tend to agree, it's not a good storage format
21:23:48 <subbu> brion, ah .. i see. well, parsoid is an example implementation that can bridge between the two.
21:23:52 <Platonides> I don't think separating old/new wikitext format is workable
21:24:07 <subbu> Platonides, content-handler.
21:24:07 <brion> gwicke: in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. nothing wrong with Parsoid but it makes it inefficient and easy to change in ways that will be weird
21:24:13 <Platonides> lots of old-page calling new-template
21:24:24 <brion> I'd love a model that's slightly higher level than HTML
21:24:25 <Platonides> or in other order
21:24:28 <SMalyshev> tgr: I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. But we can get at least basic syntax?
21:24:40 <Platonides> wiktitext fragments being passed as parameters...
21:24:48 <TimStarling> DanielK_WMDE: it's simpler to write that than to try to reduce the existing algorithm to a formal description
21:24:54 <gwicke> brion: it's trivial to simplify to XML, but then you lose the rendering
21:25:22 <brion> Yes, you also lose the rendering if your details change like URL structure
21:25:23 <gwicke> in any case, the purpose of the DOM model is to clearly define what is semantically important, and what is just one way to format it
21:25:35 <brion> Yep
21:25:39 <brion> Dom is good :)
21:25:52 <Scott_WUaS> (If MediaWiki Content Translation became part of this spec writing, or similar, in what ways would it be best to write a spec to facilitate much translation of past resources?)
21:26:00 <gwicke> otherwise, Parsoid could just have said "here, it's HTML5"
21:26:13 <gwicke> & not bothered with a DOM sepc
21:26:17 <Krinkle> brion: We can express a higher model in HTML potentially. Parsoid does that to some extent already when an expansion stage exists (e.g. we could have <mw-image> instead of <img>, the model doens't need to be renderable in browsers as-is per se)
21:26:17 <gwicke> *spec
21:26:22 <brion> Scott_WUaS: persistent ids for content snippets that don't interfere with human readability
21:26:38 <Krinkle> #info <brion> in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc.
21:26:42 <Debra> subbu: What were you hoping to accomplish in this session?
21:26:46 <Scott_WUaS> brion: thanks
21:26:55 <brion> Krinkle: yes, that's about what I'm thinking :)
21:26:58 <DanielK_WMDE> hm, has anyone looked at the attempt to build an ANTLR grammar for wikitext? If I remember corectly, the effort came quite far
21:26:59 <subbu> brion, but to answer your qn. i think using that is an implementation question ... but, the compatibility between old & new would be bridged via the output spec perhaps .. i am thinking aloud here.
21:27:15 <Krinkle> brion: OKay, so you didn't mean that the higher model should be something that isn't HTML syntax
21:27:28 <Platonides> DanielK_WMDE: but failed as always
21:27:30 <gwicke> DanielK_WMDE: there is not much benefit over PEG
21:27:32 * subbu is personally not interested in the grammar as a spec direction
21:27:33 <brion> Kringle yeah I'm agnostic to syntax really
21:27:34 <Platonides> if I remember right
21:27:36 <brion> Heh autocorrect
21:27:38 <Krinkle> (it could be browser renderable too, with custom elements nowadays)
21:27:45 <Platonides> the first 90% is easy
21:27:49 <DanielK_WMDE> Platonides: sure, it failed to cover the critical last 5%, but perhaps it's a good start. or a good lesson.
21:27:54 <Platonides> but the last 10% makes you mad
21:27:59 <DanielK_WMDE> for reference: https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
21:28:01 <TimStarling> note that we don't actually have a full PEG spec
21:28:14 <Krinkle> #info <SMalyshev> I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates.
21:28:18 <subbu> Debra, I was responding to robla's call which I felt was a good question to delve into about a spec ... to understand where perspectives are wrt to wikitext, spec, old / new wikitext, evolving wikitext, etc.
21:28:25 <gwicke> a full grammar spec is impossible anyway
21:28:35 <subbu> Debra, so far, i think it is being met well. as far as I am concerned, not sure what robla thinks. :)
21:28:41 <Debra> All right.
21:28:44 <gwicke> unless the grammar is fully turing complete
21:28:52 <robla> subbu: I think I mostly agree with you (not interested in grammar being the sole focus), but I do think syntax is important
21:28:52 <Debra> I think general discussion is fine, but sometimes people want more concrete action items from these meetings.
21:29:04 <TimStarling> we have PEG+stops, and I think it would be interesting to replace stops with an extension to the PEG formalism, instead of being implemented in JS
21:29:21 <gwicke> stops are just a compression technique
21:29:26 <TimStarling> since we are probably forking PEG.js anyway
21:29:30 <gwicke> you can unroll that into a larger grammar
21:29:47 <TimStarling> yeah, but unrolling defeats the purpose of having the grammar, the purpose is reduction
21:29:50 <Krinkle> #info <subbu> * is personally not interested in the grammar as a spec direction
21:29:51 <gwicke> very tedious, but possible
21:30:21 <Krinkle> #link https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
21:30:32 <subbu> robla, i think wikitext syntax parsing is a solved problem ... the peg grammer with stops is good enough .. TimStarling also pointed me to an ebnf grammar for the php preprocessor .. i think between the two, we have wikitext syntax tokenization covered.
21:30:56 <DanielK_WMDE> do we have a link to the PEG thing?
21:30:58 <subbu> but, that syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
21:31:06 <brion> Full data model is more complex potentially yes!
21:31:08 <TimStarling> DanielK_WMDE: you won't like it
21:31:11 <subbu> :)
21:31:20 <gwicke> DanielK_WMDE: https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
21:31:35 <robla> subbu: yup, the latter is way more interesting (understanding wikitext semantics or generaitng html from it)
21:31:45 <Krinkle> #info <subbu> syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
21:31:49 <DanielK_WMDE> gwicke: thanks
21:31:56 <brion> subbu: does the Parsoid Dom model help in terms of defining some elements?
21:31:59 <Krinkle> subbu: Exactly, we need to specify not the syntax, but what the tokens mean in relation to each other.
21:32:00 <DanielK_WMDE> #link https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
21:32:12 <brion> Or do we need to go farther with semantic info
21:32:18 <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
21:32:19 <subbu> brion, https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext#What_kind_of_specs_can_we_develop.3F addresses your qn i think
21:32:20 <brion> Eg, here's how a section is ordered
21:32:35 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Markup_spec
21:32:41 <gwicke> I'm personally most interested in developing the DOM spec further, as well as cleaning up the messy semantics around transclusion & templating
21:33:00 <TimStarling> the awkward thing about ABNF is that precedence is unspecified
21:33:07 <TimStarling> in PEG, there is a specified precedence
21:33:10 <Krinkle> #info <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
21:33:24 <brion> Subbu: Yes good :)
21:33:26 <subbu> robla, for semantics ... i think TimStarling is on the right track about an executable spec .. since we have html5 tree builder as a successful model for dealing with legacy formats.
21:33:33 <SMalyshev> gwicke: wow that file breaks syntax highlighter :)
21:33:42 <subbu> but, i think if we stuck just there, that would be unfortunate.
21:33:47 <subbu> if we *were* ..
21:33:55 <gwicke> SMalyshev: I think the .txt suffix throws it off
21:33:56 <robla> (a side note for us to cover in the last few minutes: should we have a Phab task to track the state of Parsing/Notes/A_Spec_For_Wikitext and turn the latter into an RFC)
21:34:00 <brion> The idea that extensions may need some more description is on point I think.
21:34:32 <brion> Params for modules/templates/extensions may themselves be Wikitext, or may be just little string tokens
21:34:47 <Krinkle> subbu: TimStarling: So the executable spec would dictate that when encountering <mw-template> (or {{template) that it include that target in a certain way?
21:34:52 <brion> If you want a library that greps or search replaces in text, that matters to you
21:35:25 <subbu> Krinkle, it would define that that token be preprocessed to generate expanded wikitext, for example.
21:35:35 <subbu> a reference implementation does not have to be a high-performance implementation.
21:35:37 <brion> Similar to needing to know that some HTML tags are self closing or etc
21:35:51 <subbu> it can be very slow, but built with the goal of being understandable and easy to grok.
21:35:57 <DanielK_WMDE> TimStarling: hm... much of the problematic parts are related to tag extensions, parser functions, and other transclusion mechanisms. it seems to be the preprocessor is a tool to separate these from the "wikitext proper" parts. that would perhaps make it easier to write a spec just for these parts.
21:36:20 <Krinkle> subbu: Is the goal for the spec to allow third-parties to do what MediaWiki does now when viewing an old revision? (e.g. current version of templates) - or do we intend to improve that behaviour as part of this?
21:36:25 <Platonides> the preprocessor is relatively easy
21:36:30 <Platonides> "relatively"
21:36:44 <brion> Heh
21:36:45 <gwicke> "just do it as this code does"
21:36:46 <subbu> Krinkle, that is a reasonable goal, yes.
21:36:47 <DanielK_WMDE> Krinkle: i think an executable speci is called a "parser".
21:36:55 <Krinkle> (or perhaps even move it out of the spec, by as gwicke mentioned, to consider that only an input method to the model, and never a storage format)
21:37:33 <robla> the model would still have a canonical disk representation
21:37:49 <subbu> DanielK_WMDE, maybe .. but, I think it would also extract the most important semantics out of the guts of mediawiki.
21:38:09 <DanielK_WMDE> a really well written, nicely readable parser could server as a spec. probably a recursive descend parser.
21:38:25 <gwicke> DanielK_WMDE: PEG is recursive descent
21:38:28 <subbu> i.e. how far can you pull the parser out and see how much spaghetti links back into the mediawiki guts you can cut out without breaking the essential interpretation of content.
21:38:44 <subbu> so, for example, red links, etc. may not be essential in the executable spec.
21:39:00 <subbu> they are all best viewed as post-parser transformations even for old revisions.
21:39:07 <subbu> and need not be part of the spec.
21:39:15 <DanielK_WMDE> gwicke: so is the PEG code sufficiently readable that others could reasonably use it to build their own grammaer or parser?
21:39:27 <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
21:39:35 <gwicke> DanielK_WMDE: yes, it even compiles to a tokenizer out of the box
21:39:55 <gwicke> it's still not easy, but there is only so much complexity you can pretend to not be there
21:40:24 <robla> focusing on making the spec executable seems like a "nice to have", not a hard and fast requirement. It seems like overengineering
21:40:48 <DanielK_WMDE> gwicke: can we package the complexity into nice bundles, that can be covered or ignored? a modular spec?
21:40:57 <Krinkle> In other words, if a third party has an xml dump of all page titles and revisions, can they use this to figure out how to render them? Or do we involve other stuff that wouldn't be in there. References to other pages is doable I guess, but references to other stuff gets more complicated ({{int:}}, {{gender:}}, including special pages, extension tags)
21:41:08 <DanielK_WMDE> (to an extent, all specs are modular, since they all build on top of pre-established conventions)
21:41:17 <gwicke> like wikitext-without-italic-and-bold?
21:41:19 <brion> Mediawiki standard library ;)
21:41:22 <TimStarling> the PEG grammar would be more readable if the JS event code was separated from the PEG recognizer
21:41:25 <robla> DanielK_WMDE: yes, exactuly (builidng on other specs)
21:41:31 <Krinkle> #info <subbu> so, for example, red links, etc. may not be essential in the executable spec.
21:41:33 <TimStarling> many PEG libraries actually do that
21:41:57 <Krinkle> #info <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
21:42:16 <Krinkle> I think the executable spec would be written in natural language. The HTML5 executable spec is that way.
21:42:17 <gwicke> an incomplete spec isn't useful for the archival problem
21:42:30 <brion> Krinkle: I like the idea of a layered spec. Things like extensions and parser functions are an additional layer needed for some uses but not all
21:42:33 <arlolra> TimStarling: we should do that
21:43:02 <James_F> gwicke: Well, a parser/etc. for the log wikitext sub-type (bold/italics/links and nothing else) is probably needed.
21:43:17 <robla> gwicke: I think usefulness a matter of degrees, not binary. A really complete spec is most useful, but an incomplete spec can be useful.
21:43:19 <gwicke> also, no links / images, I guess
21:43:21 <brion> Eg if mediawiki burns in a fire and we have all this data, what do we need to separate out the various levels of data in its place
21:43:27 <gwicke> as those depend on the target of the link
21:43:39 <brion> And how can we rearrange and refactor internally to represent those layers more maintainable
21:43:43 * subbu is trying to imagine software bits burning in a fire ...
21:43:46 <brion> Hehe
21:43:46 <Krinkle> If I understand correctly, Parsoid currently considers extension tags as instructions to make a request for generated content (aside from the few ones it implements natively). So it would depend on the availability of an HTTP service.
21:44:05 <Krinkle> this is extendable, but probably not desireable for the archival use case.
21:44:21 <gwicke> hey, you can make each extension its own spec
21:44:36 <subbu> Krinkle, wikitext-native extensions need to have a parsoid-native equivalent.
21:44:49 <subbu> i mean .. extensions that process wikitext.
21:44:57 <subbu> ex: ref, gallery
21:45:18 <James_F> And then there's extensions which don't process wikitext but do depend on it, like <timeline>
21:45:43 <gwicke> the old cobol folks would have smugly argued that their language was so close to human readable that they wouldn't need to bother writing separate prose
21:46:10 <robla> gwicke: :-)
21:46:12 <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
21:46:15 <brion> Hehe
21:46:24 <DanielK_WMDE> it has the advantage that we already have half of it.
21:46:48 <DanielK_WMDE> it would not cover all layers. the missine layers should have well defined interfaces.
21:46:51 <subbu> DanielK_WMDE, it only tokenizes right now.
21:47:05 <DanielK_WMDE> well, that's at least the first layer
21:47:06 <robla> #info <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
21:47:17 <TimStarling> my position on project planning is that a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec
21:47:43 <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
21:47:49 <gwicke> it's more likely to actually happen
21:47:58 <gwicke> but for archival purposes, it doesn't seem to have much value
21:48:04 <Krinkle> #info <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
21:48:09 <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
21:48:26 <TimStarling> maybe we should be storing parsoid HTML before and after each VE edit for the same reason
21:48:29 <Krinkle> For historical review and annotation/blame
21:48:40 <robla> #info <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
21:49:05 <gwicke> we already store Parsoid HTML for each edit
21:49:10 <gwicke> but only going forward, not for the past
21:49:19 <DanielK_WMDE> gwicke: can we publish dumps of that?
21:49:24 <subbu> TimStarling, given that parsoid converts that back to equivalent wikitext, why would you need to store parsoid html before/after each ve edit?
21:49:40 <gwicke> DanielK_WMDE: yes, subject to some attention from ops
21:49:47 <brion> Well there's the whole template update issue
21:49:48 <DanielK_WMDE> subbu: because parsoid changes
21:49:52 <brion> That too
21:50:05 <DanielK_WMDE> brion: yes, indeed
21:50:07 <Krinkle> one consideration with storing expanded canonical form is oversight/deletion.
21:50:19 <subbu> brion, but, that is a generic wikitext problem, not is not restricted to ve edits.
21:50:25 <TimStarling> yeah, because parsoid changes, but I'm not going to try to sell you this idea right now
21:50:26 <brion> Yup
21:50:49 <gwicke> any spec needs versioning & format upgrades
21:51:09 <robla> so....this seems worthy of being an RFC going forward, no?
21:51:11 <Krinkle> #info <gwicke> we already store Parsoid HTML for each edit <DanielK_WMDE> gwicke: can we publish dumps of that? <gwicke> DanielK_WMDE: yes, subject to some attention from ops
21:51:22 <TimStarling> maybe it should be noted what an abysmally bad job we are doing with historical preservation of edits, for reasons unrelated to a spec
21:51:36 <DanielK_WMDE> #info the idea that stroing (and possibly publishing) parsoid HTML for each revision for achieval seems to have some support
21:51:41 <TimStarling> for example, most of the first 12 months of the history of the project are still missing
21:51:46 <gwicke> DanielK_WMDE: https://phabricator.wikimedia.org/T133547
21:52:01 <gwicke> an old dump at https://dumps.wikimedia.org/htmldumps/dumps/
21:52:37 <robla> TimStarling: I wouldn't call that abysmally bad, but I agree it'd be really fantastic to make the first 12 months more easily accessible
21:52:41 <TimStarling> 4 of those 12 months only exist in a landfill probably somewhere in Orange County
21:53:19 <gwicke> faithfully rendering early history would require changes similar to what the Memento project did
21:53:27 <TimStarling> the other 8 just nobody could be bothered importing
21:53:28 <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
21:53:31 * robla is now depressed after Tim's landfill remark :-( (because he imagines Tim's not wrong)
21:53:41 <gwicke> and even that would only go half way at best
21:53:53 <brion> subbu: :)
21:54:02 <gwicke> but, fortunately early content is also a lot simpler, so in practice not much actual content should be lost
21:54:18 <brion> I love the idea of specing things like extension input models better
21:54:22 <Krinkle> #info <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
21:54:26 <Platonides> maybe we could start importing that early data
21:54:46 <Platonides> then we could more optimistacally go ahead with the future wikitext
21:54:47 <gwicke> subbu: I wouldn't call it wikitext spec
21:54:51 <robla> subbu, it'd be great for you to file a stub Phab task for us to track state of the wiki page, could you do that?
21:55:08 <gwicke> "wiki content processing spec" or "wiki content model spec"?
21:55:16 <subbu> gwicke, sure .. <insert-name-here> spec
21:55:26 * DanielK_WMDE whispers "hooks" @brion
21:55:30 <brion> subbu: yeah, content model or document model perhaps is the angle
21:55:41 <gwicke> the page component / composition stuff is aiming in that direction as well
21:55:56 <subbu> robla, i didn't follow reg. "track state of the wiki page" part .. can you say more?
21:55:56 <Krinkle> brion: If we go with the route of promoting (expanded, but annotated) html as archival format, that would remove dependencies like extensions. We'd store wikitext for review only (as direct input, no intent to re-parse).
21:55:56 <robla> we can morph https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext into an RFC
21:55:56 * brion is a pirate and has hooks for hands. Arrrrr!
21:55:56 <subbu> ah, morphing that into a rfc .. task for that?
21:56:16 <subbu> if yes, sure. i can.
21:56:18 <brion> Krinkle: mmmmm, depends on the extension. If it needs js, your life is harder
21:56:22 <robla> subbu, yup
21:56:44 <robla> #action subbu file a Phab task to track https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext for possible conversion to RFC
21:56:58 <gwicke> the fun part is that anything new will have to consider existing content
21:57:01 <Krinkle> brion: enhancement (same for CSS). Presumably the core content wouldn't depend on JS. Interaction and styling are up to the consumer to decide on.
21:57:11 <Scott_WUaS> (Cheers:)
21:57:12 <brion> *nod*
21:57:15 <Krinkle> If not, that's a bug in the extension :)
21:57:17 <Platonides> "fun"
21:57:18 <brion> :)
21:57:20 <Krinkle> (and we have some)
21:57:33 <Krinkle> Any last thoughts?
21:57:56 <gwicke> specs are hard
21:57:58 <Krinkle> #info subbu to create RFC
21:58:05 <brion> Specs and models for everyooooooone
21:58:24 <Krinkle> #info <gwicke> specs are hard
21:58:29 <robla> :-)
21:58:30 <brion> True :)
21:58:36 <Krinkle> #info <brion> Specs and models for everyooooooone
21:58:38 <Krinkle> #endmeeting