HomePhabricator

ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office)
ActivePublic

Hosted by daniel on Aug 10 2016, 9:00 PM - 10:00 PM.

Description

Agenda

This week's office hour: Wikitext! This discussion is intended to be
a continuation of the "Loosing the history of our projects to bitrot."
thread.

Coren stated the work in front of us very well at the start of it.

You know, this is actually quite troublesome: as the platform evolves
the older data becomes increasingly hard to use at all - making it
effectively lost even if we kept the bits around. This is a rather
widespread issue in computing as a rule; but I now find myself distressed
at its unavoidable effect on what we've always intended to be a permanent
contribution to humanity.

The thread he started had pretty robust participation on a really
important, which seemed to us in ArchCom worth continuing the
discussion in one of our weekly office hours. So, after checking with
Subbu (in my list message) that's what ended up as the top candidate.
Subbu did some work to structure the conversation ([A Spec For
Wikitext] and I did some cleanup of the [Wikitext] page on
mw.org as a possible hub for information on this topic, with
[Talk:Wikitext] providing a durable conversation venue.

Meeting summary

  • LINK: https://phabricator.wikimedia.org/E259 (Krinkle, 21:00:34)
  • Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (Krinkle, 21:00:48)
    • LINK: https://phabricator.wikimedia.org/E259 (Krinkle, 21:01:06)
    • LINK: https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext (Krinkle, 21:03:17)
    • LINK: https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086200.html (Krinkle, 21:03:23)
    • subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec" (Krinkle, 21:06:33)
    • <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"? (Krinkle, 21:08:43)
    • <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation (Krinkle, 21:11:42)
    • <brion> But the data model of a classic <gallery> tag is a distinct domain specific language (Krinkle, 21:12:56)
    • <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec. (Krinkle, 21:15:13)
    • <DanielK_WMDE> just an accepting grammar would be useless (Krinkle, 21:20:04)
    • <brion> Are we specing Wikitext the character sequence <brion> Or the document model? (Krinkle, 21:20:55)
    • <subbu> let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0. the former lets you deal with old content. (Krinkle, 21:22:41)
    • <subbu> the latter lets you clean up wikitext and move forward. (Krinkle, 21:22:48)
    • <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format. <gwicke> I tend to agree, it's not a good storage format (Krinkle, 21:23:43)
    • <brion> in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. (Krinkle, 21:26:38)
    • <SMalyshev> I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. (Krinkle, 21:28:14)
    • <subbu> * is personally not interested in the grammar as a spec direction (Krinkle, 21:29:50)
    • LINK: https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft (Krinkle, 21:30:21)
    • <subbu> syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it. (Krinkle, 21:31:45)
    • LINK: https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt (DanielK_WMDE, 21:32:00)
    • LINK: https://www.mediawiki.org/wiki/Markup_spec (DanielK_WMDE, 21:32:35)
    • <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF (Krinkle, 21:33:10)
    • <subbu> so, for example, red links, etc. may not be essential in the executable spec. (Krinkle, 21:41:31)
    • <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful (Krinkle, 21:41:57)
    • Â <DanielK_WMDE>Â perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize (robla, 21:47:06)
    • <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things (Krinkle, 21:48:04)
    • <TimStarling>Â although storing wikitext is still essential, to keep a record of user intentions with each edit (robla, 21:48:40)
    • <gwicke> we already store Parsoid HTML for each edit <DanielK_WMDE> gwicke: can we publish dumps of that? <gwicke> DanielK_WMDE: yes, subject to some attention from ops (Krinkle, 21:51:11)
    • the idea that stroing (and possibly publishing) parsoid HTML for each revision for achieval seems to have some support (DanielK_WMDE, 21:51:36)
    • <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :) (Krinkle, 21:54:22)
    • ACTION: subbu file a Phab task to track https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext for possible conversion to RFC (robla, 21:56:44)
    • subbu to create RFC (Krinkle, 21:57:58)
    • <gwicke> specs are hard (Krinkle, 21:58:24)
    • <brion> Specs and models for everyooooooone (Krinkle, 21:58:36)

Meeting ended at 21:58:38 UTC.

People present (lines said)

  • Krinkle (61)
  • brion (59)
  • subbu (56)
  • gwicke (56)
  • DanielK_WMDE (36)
  • robla (35)
  • TimStarling (27)
  • Platonides (13)
  • tgr (10)
  • SMalyshev (10)
  • James_F (5)
  • Scott_WUaS (5)
  • wm-labs-meetbot` (3)
  • Debra (3)
  • legoktm (3)
  • arlolra (2)

Full log

121:00:09 <Krinkle> #startmeeting ArchCom RFC Meeting W32: A spec for Wikitext
221:00:11 <wm-labs-meetbot`> Meeting started Wed Aug 10 21:00:09 2016 UTC and is due to finish in 60 minutes. The chair is Krinkle. Information about MeetBot at http://wiki.debian.org/MeetBot.
321:00:11 <wm-labs-meetbot`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
421:00:11 <wm-labs-meetbot`> The meeting name has been set to 'archcom_rfc_meeting_w32__a_spec_for_wikitext'
521:00:34 <Krinkle> https://phabricator.wikimedia.org/E259
621:00:48 <Krinkle> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
721:01:01 <legoktm> o/
821:01:06 <Krinkle> #link https://phabricator.wikimedia.org/E259
921:01:09 <Krinkle> Hey all
1021:01:24 <robla> o/
1121:01:42 <SMalyshev> hey
1221:01:44 * Krinkle has his first time meetbot experience
1321:01:46 <robla> thanks for chairing Krinkle!
1421:02:21 <subbu> o/
1521:03:09 <Scott_WUaS> (Yay Krinkle :)
1621:03:15 <Krinkle> The main topic for today will be about whether and how we'll proceed with the specification of wikitext. This follows after an essay Subbu wrote on mediawiki.org and a wikitech-l thread.
1721:03:17 <Krinkle> #link https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext
1821:03:23 <Krinkle> #link https://lists.wikimedia.org/pipermail/wikitech-l/2016-August/086200.html
1921:04:27 <robla> so, should we have a spec? I say "yes"
2021:05:06 <subbu> as i have argued in that essay, "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec".
2121:05:14 <subbu> and there are different specs for different needs / audiences.
2221:05:15 <SMalyshev> surely we should. It'd be quite time-consuming task though
2321:05:23 <brion> A spec for what, in what context :)
2421:05:55 <Krinkle> One of the problems people would like to see solved in this area is to be able to confidently interact with older content. The status quo is that things change with time, and that rendering is somewhat unpredictable for older revisions (expansion of templates and external links was mentioned).
2521:06:25 <gwicke> specifically, older content stored in wikitext
2621:06:29 <Krinkle> # subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
2721:06:33 <Krinkle> #info subbu quotes from essay "whether we want a spec cannot be separated from the question of what the goals are of wanting a spec"
2821:06:39 <brion> We could for instance specify "legacy" Wikitext as it exists, but that doesn't make it an easy system. It'll still be complex
2921:06:49 <robla> a spec is also helpful to make sure that the markup means the same thing to humans and computers
3021:06:58 <SMalyshev> I'd say at least a narrative spec from which a reasonably knowledgeable programmer would be able to implement a parser which renders at least 90% (arbitrary high percentage here) of wikitext properly
3121:07:08 <brion> I suspect we need that for the context of interacting with old archival content, even if what we use for future content changes
3221:07:33 <gwicke> SMalyshev: we pretty much have that already, in the form of a PEG grammar
3321:07:47 <subbu> gwicke, well ... tokenizer, you mean.
3421:08:14 <TimStarling> specifying "legacy" wikitext does not necesarily meet subbu's goals, e.g. "Ease implementation of tools and libraries that need to operate with wikitext directly"
3521:08:15 <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
3621:08:19 <James_F> Also the concept of "what is wikitext" differs by installation. Are the core parser functions part of the spec? What about the others? What about <gallery> syntax? Etc. Each change we make is generally breaking for people that use it.
3721:08:23 <gwicke> it covers the "easy" parts SMalyshev mentioned
3821:08:24 <SMalyshev> gwicke: PEG grammar alone doesn't seem enough for me... it specifies what can be said, but not what it means?
3921:08:25 <subbu> gwicke, there is no guarantee those tokens will render as they are tokenized.
4021:08:26 <robla> a spec can also help us comment those areas where computers will do counterintuitive things with the markup
4121:08:37 <tgr> if it would, how is that different from "call the MediaWiki parser version X with parameters Y"?
4221:08:43 <Krinkle> #info <tgr> would the spec include stuff like "call the Lua compiler version X with parameters Y"?
4321:08:59 <subbu> gwicke, right. it is useful for sure.
4421:09:11 <Krinkle> This is an interesting end use-case. How to deal with underlying dependencies. Same goes for syntax inside extension tags.
4521:09:33 <brion> I think for future facing needs its more useful to have a clean document model
4621:09:35 <SMalyshev> tgr: I think "call the Lua compiler" is not what most people mean by spec...
4721:09:49 <legoktm> would extensions that add parser functions / tags have their own specs?
4821:10:01 <brion> For instance if layouts for tables etc were mor structured and separate from the tokens for content
4921:10:11 <tgr> SMalyshev: we do rely heavily on Lua scripts, how would you specify around that then?
5021:10:17 <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
5121:10:19 <subbu> brion, yup .. to me, as well, the most useful argument for a spec is for future needs, i.e. clean up the processing / document model.
5221:10:22 <brion> Cleaner interfaces between components such as templates and Lua modules
5321:10:27 <robla> SMalyshev: it's a reasonable spec-writing crutch to refer to other specs, even if those specs aren't very well specified
5421:10:34 <DanielK_WMDE> legoktm: ideally, yes. at least, they can't really be covered by the main spec...
5521:10:53 <SMalyshev> tgr: have docs for lua scripts... but I think we need to be declarative, not procedural, here
5621:10:57 <Krinkle> We could consider extension behaviour outside the scope of the spec. Extensions would need to either maintain indefinite compatibility, or somehow have required attributes for versioning (e.g. <foo version="">) and then decide to keep older parsers or to transform it somehow internally.
5721:11:00 <subbu> gwicke, +1
5821:11:35 <brion> One thing to consider with extensions is having a clean enough interface that we can actually do that :)
5921:11:42 <Krinkle> #info <gwicke> I think it's pretty clear that a spec that would fully address the archival use case would be extremely expensive & probably harder to read than an actual implementation
6021:11:45 <subbu> i think it might be useful to address the first question: why a spec? before going into discussions about what kind of spec maybe.
6121:11:46 <robla> gwicke: subbu : do you think the implementation should be the spec, then?
6221:11:50 <James_F> Krinkle: Including inside-core 'extension' tags like <gallery>?
6321:11:54 <James_F> Yeah.
6421:11:59 <tgr> SMalyshev: wikinews uses a Lisp interpreter written in Lua for some of its templates
6521:12:00 <brion> For instance the data model of the text passed into a ref tag should be known
6621:12:05 <tgr> have fun documenting that
6721:12:06 <Krinkle> Essentially yes.
6821:12:14 <tgr> as gwicke said, it's just not a realistic goal
6921:12:15 <brion> But the data model of a classic gallery tag is a distinct domain specific language
7021:12:35 <robla> tgr: it's not realistic to invest in if we don't believe the data is very important
7121:12:38 <Krinkle> But I think it's more important for the spec to detail what the impact is of returned html from <gallery> with the rest of the content, more important than the syntax of the text inside <gallery>
7221:12:40 <subbu> robla, for legacy wikitext, i think that is the unfortunate reality.
7321:12:56 <Krinkle> #info <brion> But the data model of a classic <gallery> tag is a distinct domain specific language
7421:12:57 <gwicke> robla: that would imply that a spec would be the only way to preserve data
7521:12:57 <brion> That's something our current system doesn't grok, so we lose the ability to refer to the contents of the ref
7621:13:43 <robla> gwicke: no it doesn't. we would still have our implementation of the spec
7721:13:52 <gwicke> it's far from clear that a spec is a viable way to do that at all, and even less so that it is the only viable way
7821:14:05 <legoktm> I don't think focusing on a spec as the way to solve the old wikitext no longer works properly is a good way to frame this discussion - we should focus on the reasons we need a spec and that it could possibly help with that problem
7921:14:31 <Krinkle> One radical idea would be consider our shortcuts (extensions, templates) merely a way to provide input and create a revision, rather than being the revision itself. Essentially producing something in between that is still minimal and canonical but not end-user/localised/skinned.
8021:14:48 <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
8121:14:50 <Krinkle> that would move expansion out of scope somewhat. (e.g. does ~~~~ need a spec?)
8221:15:13 <Krinkle> #info <subbu> I propose we first address the question: why a spec, i.e. what are the goals for writing a spec.
8321:15:39 * gwicke is curious
8421:15:41 <robla> goal for writing a spec: have a human readable description of how wikitext will be interpreted by computers
8521:15:57 <brion> Goal: to have a consistent way to interpret, edit, and display Wikitext from any era after the spec epoch consistently (? Sample)
8621:16:17 <DanielK_WMDE> some reasons to have a spec: allows a canonical test suit, allows transformation into other markup formats, enables multiple alternative parsers/processors (and editors)
8721:16:35 <robla> brion: I suppose the example I keep trotting out is ANSI C. Very complicated, but very worthwhile
8821:16:35 * tgr finds the MediaWiki parser more human-readable than twenty pages of ABNF
8921:16:36 <DanielK_WMDE> a spect would need to define the semantics of each syntactical element
9021:16:37 <gwicke> DanielK_WMDE: we already have a test suite
9121:16:42 <DanielK_WMDE> just an accepting grammar would be useless
9221:16:58 <gwicke> brion: does your proposal require a spec, or an implementation?
9321:17:00 <TimStarling> subbu's idea of an executable (HTML5-like) spec is interesting to me
9421:17:01 <DanielK_WMDE> gwicke: sure. so?
9521:17:09 <arlolra> maybe there's a parallel to be drawn between wikitext and js's don't break the web model
9621:17:27 <brion> gwicke: depends on what were specing
9721:17:29 <subbu> goal: write a future-looking spec to evolve the wikitext language and processing model .. which doesn't help with old wikitext of course.
9821:17:31 <DanielK_WMDE> gwicke: is it canonical? in thw way that we say if the software breaks a test, this MUST be because the software is wrong?
9921:17:43 <brion> Are we specing Wikitext the character sequence
10021:17:44 <robla> tgr: for C programming, do you think that compilers are easier to read than the ANSI C spec?
10121:17:48 <brion> Or the document model?
10221:17:49 <DanielK_WMDE> i often end up fixing parser tests, not code... because the tests make assumptions, or rely on unspecified behavior
10321:17:52 <gwicke> DanielK_WMDE: pretty much, yes
10421:17:59 <gwicke> considering the amount of stored, existing content
10521:18:07 <subbu> TimStarling, yes .. that is one way html authors probably addressed the problem of old html out there and html compatibility.
10621:18:12 <SMalyshev> brion: both I'd say. Character sequence is probably the easier part :)
10721:18:15 <brion> A document model spec gives us what we and others need to transform our parsed documents into other formats etc
10821:18:19 <TimStarling> for some purposes, it would be nice to be reductionist, write grammars etc., but for archiving and ease of reimplementation it makes more sense to be complete
10921:18:27 <subbu> which is probably not dissimilar to the current problem we have.
11021:18:43 <TimStarling> and that really means specifying algorithms
11121:18:43 <brion> Even if we only have one implementation of the tokenizer/parser
11221:18:43 <DanielK_WMDE> gwicke: the best spec is the stored, existing content, then. if you break it, you are doing it wrong...
11321:18:47 <Krinkle> TimStarling: subbu: an executable spec, would that mean expansion is part of the model, or left to producers of content (e.g. notion of a "template" could be in the data atrributes, but not required for consumers to understand)
11421:18:55 <gwicke> that's how tests are set up, and parsoid was tested
11521:19:08 <robla> another goal: interoperability between multiple implementations
11621:19:08 <tgr> robla: no clue about that, but there are several magnitudes in difference between the resources behind C and wikitext, and also the potential userbase, so I don't think it's a useful comparison
11721:19:22 <SMalyshev> DanielK_WMDE: the problem it's not a constructive spec :) It doesn't giv you a way to do it right, only tells you when it's wrong
11821:19:25 <DanielK_WMDE> robla: hell yea
11921:19:32 <subbu> so, let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0
12021:19:41 <subbu> the former lets you deal with old content.
12121:19:43 <brion> :)
12221:19:50 <subbu> the latter lets you clean up wikitext and move forward. :)
12321:19:58 <DanielK_WMDE> SMalyshev: many things in life are like that ;)
12421:19:58 <brion> I like
12521:20:04 <Krinkle> #info <DanielK_WMDE> just an accepting grammar would be useless
12621:20:14 <robla> subbu: is there an example of a really good executable spec?
12721:20:14 <subbu> DanielK_WMDE, i agree reg. an accept grammar.
12821:20:17 <subbu> html5
12921:20:24 <TimStarling> Krinkle: I think expansion would have to be fully part of the executable spec, including what version of Lua to use etc.
13021:20:37 <brion> Subbu do you imagine a common document model attached to both old and new grammars?
13121:20:40 <DanielK_WMDE> well, "yes" is an accepting grammar
13221:20:47 <DanielK_WMDE> since all text is valid wikitext
13321:20:50 <brion> Heh
13421:20:55 <Krinkle> #info <brion> Are we specing Wikitext the character sequence <brion> Or the document model?
13521:21:03 <TimStarling> this is what subbu means by an executable spec: https://www.w3.org/TR/html5/syntax.html
13621:21:06 * subbu is trying to find a link to the html5 tree building algo.
13721:21:12 <subbu> oh, there TimStarling posted it
13821:21:16 <TimStarling> it's a natural language description of an algorithm
13921:21:25 * robla looks at the link
14021:21:33 <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format
14121:21:52 * robla sees a lot of English in that ;-)
14221:21:53 <gwicke> I tend to agree, it's not a good storage format
14321:21:57 <DanielK_WMDE> i thin we need both: characters -> AST -> semantics. We could rely on the HTML spec for some parts of the AST and semantics.
14421:22:18 <subbu> brion, i didn't understand your qn. reg. common document model
14521:22:22 <tgr> and then the spec would throw up its hands any time an extension is involved? what is the use case for that?
14621:22:29 <brion> gwicke: yes, though our main alternative now is HTML which is at the wrong abstraction level :)
14721:22:41 <Krinkle> #info <subbu> let me reframe my goal: an executable spec for "old / legacy wikitext" + clean wikitext processing model as a spec for wikitext 2.0. the former lets you deal with old content.
14821:22:41 <gwicke> brion: is it?
14921:22:46 <tgr> to be able to write an alternative parser that absolutely cannot handle real wikitext as it appears on Wikipedia?
15021:22:48 <Krinkle> #info <subbu> the latter lets you clean up wikitext and move forward.
15121:23:02 <gwicke> brion: which issues do you see in the Parsoid DOM spec?
15221:23:10 <Scott_WUaS> (If there's a choice in spec development between archiving a lot/all and archiving some, I'd be for the former, with some limitations).
15321:23:13 <brion> subbu: I'm thinking like, would old and new parser rules end up creating compatible in memory object representations that could be transformed to one another
15421:23:20 <DanielK_WMDE> robla: "Script data double escaped less-than sign state" <-- do we really want to go there?
15521:23:43 <Krinkle> #info <gwicke> many have made the case that that wikitext should be treated as a textual UI & not as a storage format. <gwicke> I tend to agree, it's not a good storage format
15621:23:48 <subbu> brion, ah .. i see. well, parsoid is an example implementation that can bridge between the two.
15721:23:52 <Platonides> I don't think separating old/new wikitext format is workable
15821:24:07 <subbu> Platonides, content-handler.
15921:24:07 <brion> gwicke: in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc. nothing wrong with Parsoid but it makes it inefficient and easy to change in ways that will be weird
16021:24:13 <Platonides> lots of old-page calling new-template
16121:24:24 <brion> I'd love a model that's slightly higher level than HTML
16221:24:25 <Platonides> or in other order
16321:24:28 <SMalyshev> tgr: I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates. But we can get at least basic syntax?
16421:24:40 <Platonides> wiktitext fragments being passed as parameters...
16521:24:48 <TimStarling> DanielK_WMDE: it's simpler to write that than to try to reduce the existing algorithm to a formal description
16621:24:54 <gwicke> brion: it's trivial to simplify to XML, but then you lose the rendering
16721:25:22 <brion> Yes, you also lose the rendering if your details change like URL structure
16821:25:23 <gwicke> in any case, the purpose of the DOM model is to clearly define what is semantically important, and what is just one way to format it
16921:25:35 <brion> Yep
17021:25:39 <brion> Dom is good :)
17121:25:52 <Scott_WUaS> (If MediaWiki Content Translation became part of this spec writing, or similar, in what ways would it be best to write a spec to facilitate much translation of past resources?)
17221:26:00 <gwicke> otherwise, Parsoid could just have said "here, it's HTML5"
17321:26:13 <gwicke> & not bothered with a DOM sepc
17421:26:17 <Krinkle> brion: We can express a higher model in HTML potentially. Parsoid does that to some extent already when an expansion stage exists (e.g. we could have <mw-image> instead of <img>, the model doens't need to be renderable in browsers as-is per se)
17521:26:17 <gwicke> *spec
17621:26:22 <brion> Scott_WUaS: persistent ids for content snippets that don't interfere with human readability
17721:26:38 <Krinkle> #info <brion> in general HTML has too much low level detail: images list several distinct URLs, you have lots of presentation markup, etc.
17821:26:42 <Debra> subbu: What were you hoping to accomplish in this session?
17921:26:46 <Scott_WUaS> brion: thanks
18021:26:55 <brion> Krinkle: yes, that's about what I'm thinking :)
18121:26:58 <DanielK_WMDE> hm, has anyone looked at the attempt to build an ANTLR grammar for wikitext? If I remember corectly, the effort came quite far
18221:26:59 <subbu> brion, but to answer your qn. i think using that is an implementation question ... but, the compatibility between old & new would be bridged via the output spec perhaps .. i am thinking aloud here.
18321:27:15 <Krinkle> brion: OKay, so you didn't mean that the higher model should be something that isn't HTML syntax
18421:27:28 <Platonides> DanielK_WMDE: but failed as always
18521:27:30 <gwicke> DanielK_WMDE: there is not much benefit over PEG
18621:27:32 * subbu is personally not interested in the grammar as a spec direction
18721:27:33 <brion> Kringle yeah I'm agnostic to syntax really
18821:27:34 <Platonides> if I remember right
18921:27:36 <brion> Heh autocorrect
19021:27:38 <Krinkle> (it could be browser renderable too, with custom elements nowadays)
19121:27:45 <Platonides> the first 90% is easy
19221:27:49 <DanielK_WMDE> Platonides: sure, it failed to cover the critical last 5%, but perhaps it's a good start. or a good lesson.
19321:27:54 <Platonides> but the last 10% makes you mad
19421:27:59 <DanielK_WMDE> for reference: https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
19521:28:01 <TimStarling> note that we don't actually have a full PEG spec
19621:28:14 <Krinkle> #info <SMalyshev> I think there are degrees of it. Even right now wikitext on one wiki may be not reproducible on other wiki because of missing modules/templates.
19721:28:18 <subbu> Debra, I was responding to robla's call which I felt was a good question to delve into about a spec ... to understand where perspectives are wrt to wikitext, spec, old / new wikitext, evolving wikitext, etc.
19821:28:25 <gwicke> a full grammar spec is impossible anyway
19921:28:35 <subbu> Debra, so far, i think it is being met well. as far as I am concerned, not sure what robla thinks. :)
20021:28:41 <Debra> All right.
20121:28:44 <gwicke> unless the grammar is fully turing complete
20221:28:52 <robla> subbu: I think I mostly agree with you (not interested in grammar being the sole focus), but I do think syntax is important
20321:28:52 <Debra> I think general discussion is fine, but sometimes people want more concrete action items from these meetings.
20421:29:04 <TimStarling> we have PEG+stops, and I think it would be interesting to replace stops with an extension to the PEG formalism, instead of being implemented in JS
20521:29:21 <gwicke> stops are just a compression technique
20621:29:26 <TimStarling> since we are probably forking PEG.js anyway
20721:29:30 <gwicke> you can unroll that into a larger grammar
20821:29:47 <TimStarling> yeah, but unrolling defeats the purpose of having the grammar, the purpose is reduction
20921:29:50 <Krinkle> #info <subbu> * is personally not interested in the grammar as a spec direction
21021:29:51 <gwicke> very tedious, but possible
21121:30:21 <Krinkle> #link https://www.mediawiki.org/wiki/Markup_spec/ANTLR/draft
21221:30:32 <subbu> robla, i think wikitext syntax parsing is a solved problem ... the peg grammer with stops is good enough .. TimStarling also pointed me to an ebnf grammar for the php preprocessor .. i think between the two, we have wikitext syntax tokenization covered.
21321:30:56 <DanielK_WMDE> do we have a link to the PEG thing?
21421:30:58 <subbu> but, that syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
21521:31:06 <brion> Full data model is more complex potentially yes!
21621:31:08 <TimStarling> DanielK_WMDE: you won't like it
21721:31:11 <subbu> :)
21821:31:20 <gwicke> DanielK_WMDE: https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
21921:31:35 <robla> subbu: yup, the latter is way more interesting (understanding wikitext semantics or generaitng html from it)
22021:31:45 <Krinkle> #info <subbu> syntax spec is not useful for actually understanding wikitext semantics or generaitng html from it.
22121:31:49 <DanielK_WMDE> gwicke: thanks
22221:31:56 <brion> subbu: does the Parsoid Dom model help in terms of defining some elements?
22321:31:59 <Krinkle> subbu: Exactly, we need to specify not the syntax, but what the tokens mean in relation to each other.
22421:32:00 <DanielK_WMDE> #link https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs.txt
22521:32:12 <brion> Or do we need to go farther with semantic info
22621:32:18 <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
22721:32:19 <subbu> brion, https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext#What_kind_of_specs_can_we_develop.3F addresses your qn i think
22821:32:20 <brion> Eg, here's how a section is ordered
22921:32:35 <DanielK_WMDE> #link https://www.mediawiki.org/wiki/Markup_spec
23021:32:41 <gwicke> I'm personally most interested in developing the DOM spec further, as well as cleaning up the messy semantics around transclusion & templating
23121:33:00 <TimStarling> the awkward thing about ABNF is that precedence is unspecified
23221:33:07 <TimStarling> in PEG, there is a specified precedence
23321:33:10 <Krinkle> #info <TimStarling> spec for tokenization in the MW preprocessor: https://www.mediawiki.org/wiki/Preprocessor_ABNF
23421:33:24 <brion> Subbu: Yes good :)
23521:33:26 <subbu> robla, for semantics ... i think TimStarling is on the right track about an executable spec .. since we have html5 tree builder as a successful model for dealing with legacy formats.
23621:33:33 <SMalyshev> gwicke: wow that file breaks syntax highlighter :)
23721:33:42 <subbu> but, i think if we stuck just there, that would be unfortunate.
23821:33:47 <subbu> if we *were* ..
23921:33:55 <gwicke> SMalyshev: I think the .txt suffix throws it off
24021:33:56 <robla> (a side note for us to cover in the last few minutes: should we have a Phab task to track the state of Parsing/Notes/A_Spec_For_Wikitext and turn the latter into an RFC)
24121:34:00 <brion> The idea that extensions may need some more description is on point I think.
24221:34:32 <brion> Params for modules/templates/extensions may themselves be Wikitext, or may be just little string tokens
24321:34:47 <Krinkle> subbu: TimStarling: So the executable spec would dictate that when encountering <mw-template> (or {{template) that it include that target in a certain way?
24421:34:52 <brion> If you want a library that greps or search replaces in text, that matters to you
24521:35:25 <subbu> Krinkle, it would define that that token be preprocessed to generate expanded wikitext, for example.
24621:35:35 <subbu> a reference implementation does not have to be a high-performance implementation.
24721:35:37 <brion> Similar to needing to know that some HTML tags are self closing or etc
24821:35:51 <subbu> it can be very slow, but built with the goal of being understandable and easy to grok.
24921:35:57 <DanielK_WMDE> TimStarling: hm... much of the problematic parts are related to tag extensions, parser functions, and other transclusion mechanisms. it seems to be the preprocessor is a tool to separate these from the "wikitext proper" parts. that would perhaps make it easier to write a spec just for these parts.
25021:36:20 <Krinkle> subbu: Is the goal for the spec to allow third-parties to do what MediaWiki does now when viewing an old revision? (e.g. current version of templates) - or do we intend to improve that behaviour as part of this?
25121:36:25 <Platonides> the preprocessor is relatively easy
25221:36:30 <Platonides> "relatively"
25321:36:44 <brion> Heh
25421:36:45 <gwicke> "just do it as this code does"
25521:36:46 <subbu> Krinkle, that is a reasonable goal, yes.
25621:36:47 <DanielK_WMDE> Krinkle: i think an executable speci is called a "parser".
25721:36:55 <Krinkle> (or perhaps even move it out of the spec, by as gwicke mentioned, to consider that only an input method to the model, and never a storage format)
25821:37:33 <robla> the model would still have a canonical disk representation
25921:37:49 <subbu> DanielK_WMDE, maybe .. but, I think it would also extract the most important semantics out of the guts of mediawiki.
26021:38:09 <DanielK_WMDE> a really well written, nicely readable parser could server as a spec. probably a recursive descend parser.
26121:38:25 <gwicke> DanielK_WMDE: PEG is recursive descent
26221:38:28 <subbu> i.e. how far can you pull the parser out and see how much spaghetti links back into the mediawiki guts you can cut out without breaking the essential interpretation of content.
26321:38:44 <subbu> so, for example, red links, etc. may not be essential in the executable spec.
26421:39:00 <subbu> they are all best viewed as post-parser transformations even for old revisions.
26521:39:07 <subbu> and need not be part of the spec.
26621:39:15 <DanielK_WMDE> gwicke: so is the PEG code sufficiently readable that others could reasonably use it to build their own grammaer or parser?
26721:39:27 <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
26821:39:35 <gwicke> DanielK_WMDE: yes, it even compiles to a tokenizer out of the box
26921:39:55 <gwicke> it's still not easy, but there is only so much complexity you can pretend to not be there
27021:40:24 <robla> focusing on making the spec executable seems like a "nice to have", not a hard and fast requirement. It seems like overengineering
27121:40:48 <DanielK_WMDE> gwicke: can we package the complexity into nice bundles, that can be covered or ignored? a modular spec?
27221:40:57 <Krinkle> In other words, if a third party has an xml dump of all page titles and revisions, can they use this to figure out how to render them? Or do we involve other stuff that wouldn't be in there. References to other pages is doable I guess, but references to other stuff gets more complicated ({{int:}}, {{gender:}}, including special pages, extension tags)
27321:41:08 <DanielK_WMDE> (to an extent, all specs are modular, since they all build on top of pre-established conventions)
27421:41:17 <gwicke> like wikitext-without-italic-and-bold?
27521:41:19 <brion> Mediawiki standard library ;)
27621:41:22 <TimStarling> the PEG grammar would be more readable if the JS event code was separated from the PEG recognizer
27721:41:25 <robla> DanielK_WMDE: yes, exactuly (builidng on other specs)
27821:41:31 <Krinkle> #info <subbu> so, for example, red links, etc. may not be essential in the executable spec.
27921:41:33 <TimStarling> many PEG libraries actually do that
28021:41:57 <Krinkle> #info <robla> the thing that's nice about a natural language version of a spec (as opposed to an executable one) is that it's possible to have an "incomplete" spec that's still useful
28121:42:16 <Krinkle> I think the executable spec would be written in natural language. The HTML5 executable spec is that way.
28221:42:17 <gwicke> an incomplete spec isn't useful for the archival problem
28321:42:30 <brion> Krinkle: I like the idea of a layered spec. Things like extensions and parser functions are an additional layer needed for some uses but not all
28421:42:33 <arlolra> TimStarling: we should do that
28521:43:02 <James_F> gwicke: Well, a parser/etc. for the log wikitext sub-type (bold/italics/links and nothing else) is probably needed.
28621:43:17 <robla> gwicke: I think usefulness a matter of degrees, not binary. A really complete spec is most useful, but an incomplete spec can be useful.
28721:43:19 <gwicke> also, no links / images, I guess
28821:43:21 <brion> Eg if mediawiki burns in a fire and we have all this data, what do we need to separate out the various levels of data in its place
28921:43:27 <gwicke> as those depend on the target of the link
29021:43:39 <brion> And how can we rearrange and refactor internally to represent those layers more maintainable
29121:43:43 * subbu is trying to imagine software bits burning in a fire ...
29221:43:46 <brion> Hehe
29321:43:46 <Krinkle> If I understand correctly, Parsoid currently considers extension tags as instructions to make a request for generated content (aside from the few ones it implements natively). So it would depend on the availability of an HTTP service.
29421:44:05 <Krinkle> this is extendable, but probably not desireable for the archival use case.
29521:44:21 <gwicke> hey, you can make each extension its own spec
29621:44:36 <subbu> Krinkle, wikitext-native extensions need to have a parsoid-native equivalent.
29721:44:49 <subbu> i mean .. extensions that process wikitext.
29821:44:57 <subbu> ex: ref, gallery
29921:45:18 <James_F> And then there's extensions which don't process wikitext but do depend on it, like <timeline>
30021:45:43 <gwicke> the old cobol folks would have smugly argued that their language was so close to human readable that they wouldn't need to bother writing separate prose
30121:46:10 <robla> gwicke: :-)
30221:46:12 <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
30321:46:15 <brion> Hehe
30421:46:24 <DanielK_WMDE> it has the advantage that we already have half of it.
30521:46:48 <DanielK_WMDE> it would not cover all layers. the missine layers should have well defined interfaces.
30621:46:51 <subbu> DanielK_WMDE, it only tokenizes right now.
30721:47:05 <DanielK_WMDE> well, that's at least the first layer
30821:47:06 <robla> #info <DanielK_WMDE> perhaps an improved PEG grammar as proposed by Tim with lots of semi-formal comments would be a decent compromize
30921:47:17 <TimStarling> my position on project planning is that a reductionist spec, which does not necessarily precisely reflect legacy wikitext, would be more useful than a complete spec
31021:47:43 <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
31121:47:49 <gwicke> it's more likely to actually happen
31221:47:58 <gwicke> but for archival purposes, it doesn't seem to have much value
31321:48:04 <Krinkle> #info <TimStarling> for purposes of archiving, I think HTML+CSS+images, like kiwix, is good enough for most things
31421:48:09 <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
31521:48:26 <TimStarling> maybe we should be storing parsoid HTML before and after each VE edit for the same reason
31621:48:29 <Krinkle> For historical review and annotation/blame
31721:48:40 <robla> #info <TimStarling> although storing wikitext is still essential, to keep a record of user intentions with each edit
31821:49:05 <gwicke> we already store Parsoid HTML for each edit
31921:49:10 <gwicke> but only going forward, not for the past
32021:49:19 <DanielK_WMDE> gwicke: can we publish dumps of that?
32121:49:24 <subbu> TimStarling, given that parsoid converts that back to equivalent wikitext, why would you need to store parsoid html before/after each ve edit?
32221:49:40 <gwicke> DanielK_WMDE: yes, subject to some attention from ops
32321:49:47 <brion> Well there's the whole template update issue
32421:49:48 <DanielK_WMDE> subbu: because parsoid changes
32521:49:52 <brion> That too
32621:50:05 <DanielK_WMDE> brion: yes, indeed
32721:50:07 <Krinkle> one consideration with storing expanded canonical form is oversight/deletion.
32821:50:19 <subbu> brion, but, that is a generic wikitext problem, not is not restricted to ve edits.
32921:50:25 <TimStarling> yeah, because parsoid changes, but I'm not going to try to sell you this idea right now
33021:50:26 <brion> Yup
33121:50:49 <gwicke> any spec needs versioning & format upgrades
33221:51:09 <robla> so....this seems worthy of being an RFC going forward, no?
33321:51:11 <Krinkle> #info <gwicke> we already store Parsoid HTML for each edit <DanielK_WMDE> gwicke: can we publish dumps of that? <gwicke> DanielK_WMDE: yes, subject to some attention from ops
33421:51:22 <TimStarling> maybe it should be noted what an abysmally bad job we are doing with historical preservation of edits, for reasons unrelated to a spec
33521:51:36 <DanielK_WMDE> #info the idea that stroing (and possibly publishing) parsoid HTML for each revision for achieval seems to have some support
33621:51:41 <TimStarling> for example, most of the first 12 months of the history of the project are still missing
33721:51:46 <gwicke> DanielK_WMDE: https://phabricator.wikimedia.org/T133547
33821:52:01 <gwicke> an old dump at https://dumps.wikimedia.org/htmldumps/dumps/
33921:52:37 <robla> TimStarling: I wouldn't call that abysmally bad, but I agree it'd be really fantastic to make the first 12 months more easily accessible
34021:52:41 <TimStarling> 4 of those 12 months only exist in a landfill probably somewhere in Orange County
34121:53:19 <gwicke> faithfully rendering early history would require changes similar to what the Memento project did
34221:53:27 <TimStarling> the other 8 just nobody could be bothered importing
34321:53:28 <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
34421:53:31 * robla is now depressed after Tim's landfill remark :-( (because he imagines Tim's not wrong)
34521:53:41 <gwicke> and even that would only go half way at best
34621:53:53 <brion> subbu: :)
34721:54:02 <gwicke> but, fortunately early content is also a lot simpler, so in practice not much actual content should be lost
34821:54:18 <brion> I love the idea of specing things like extension input models better
34921:54:22 <Krinkle> #info <subbu> now that we covered one part of the picture ... anyone have thoughts on moving to a future wikitext spec with an improved processing model? :)
35021:54:26 <Platonides> maybe we could start importing that early data
35121:54:46 <Platonides> then we could more optimistacally go ahead with the future wikitext
35221:54:47 <gwicke> subbu: I wouldn't call it wikitext spec
35321:54:51 <robla> subbu, it'd be great for you to file a stub Phab task for us to track state of the wiki page, could you do that?
35421:55:08 <gwicke> "wiki content processing spec" or "wiki content model spec"?
35521:55:16 <subbu> gwicke, sure .. <insert-name-here> spec
35621:55:26 * DanielK_WMDE whispers "hooks" @brion
35721:55:30 <brion> subbu: yeah, content model or document model perhaps is the angle
35821:55:41 <gwicke> the page component / composition stuff is aiming in that direction as well
35921:55:56 <subbu> robla, i didn't follow reg. "track state of the wiki page" part .. can you say more?
36021:55:56 <Krinkle> brion: If we go with the route of promoting (expanded, but annotated) html as archival format, that would remove dependencies like extensions. We'd store wikitext for review only (as direct input, no intent to re-parse).
36121:55:56 <robla> we can morph https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext into an RFC
36221:55:56 * brion is a pirate and has hooks for hands. Arrrrr!
36321:55:56 <subbu> ah, morphing that into a rfc .. task for that?
36421:56:16 <subbu> if yes, sure. i can.
36521:56:18 <brion> Krinkle: mmmmm, depends on the extension. If it needs js, your life is harder
36621:56:22 <robla> subbu, yup
36721:56:44 <robla> #action subbu file a Phab task to track https://www.mediawiki.org/wiki/Parsing/Notes/A_Spec_For_Wikitext for possible conversion to RFC
36821:56:58 <gwicke> the fun part is that anything new will have to consider existing content
36921:57:01 <Krinkle> brion: enhancement (same for CSS). Presumably the core content wouldn't depend on JS. Interaction and styling are up to the consumer to decide on.
37021:57:11 <Scott_WUaS> (Cheers:)
37121:57:12 <brion> *nod*
37221:57:15 <Krinkle> If not, that's a bug in the extension :)
37321:57:17 <Platonides> "fun"
37421:57:18 <brion> :)
37521:57:20 <Krinkle> (and we have some)
37621:57:33 <Krinkle> Any last thoughts?
37721:57:56 <gwicke> specs are hard
37821:57:58 <Krinkle> #info subbu to create RFC
37921:58:05 <brion> Specs and models for everyooooooone
38021:58:24 <Krinkle> #info <gwicke> specs are hard
38121:58:29 <robla> :-)
38221:58:30 <brion> True :)
38321:58:36 <Krinkle> #info <brion> Specs and models for everyooooooone
38421:58:38 <Krinkle> #endmeeting

Other meetings

Architecture meetings
13:00 PT ArchCom Planning Meetingsupcomingall since 2016-03-30
14:00 PT ArchCom-RFC Meetingsupcomingall since 2015-09-09

Recurring Event

Event Series
This event is an instance of E66: ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office), and repeats every week.

Event Timeline

RobLa-WMF renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W32: <topic TBD> (2016-08-10, #wikimedia-office).Aug 4 2016, 12:12 AM
RobLa-WMF updated the event description. (Show Details)
RobLa-WMF renamed this event from ArchCom RFC Meeting W32: <topic TBD> (2016-08-10, #wikimedia-office) to ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office).Aug 6 2016, 12:48 AM
RobLa-WMF updated the event description. (Show Details)
RobLa-WMF updated the event description. (Show Details)
daniel renamed this event from ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office) to ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office).Nov 21 2016, 6:11 PM
daniel changed the host of this event from RobLa-WMF to daniel.
daniel invited: ; uninvited: .
daniel updated the event description. (Show Details)
daniel renamed this event from ArchCom RFC Meeting Wxx: <topic TBD> (<see "Starts" field>, #wikimedia-office) to ArchCom RFC Meeting W32: Wikitext (2016-08-10, #wikimedia-office).