Page MenuHomePhabricator

Structured data side channel for wikitext
Open, Needs TriagePublic

Description

The problem of passing structured data from wikitext to external applications comes up in a wide variety of contexts, and a garden of ugly workarounds has grown around it, usually consisting of encoding the data in the HTML rendered from wikitext in some way, then external applications parsing it out and restoring the structure. Examples include CommonsMetadata, the various services (Mobile-Content-Service, all kinds of Tool Labs tools) exposing mainpage/featured content (article/picture of the day, anniversaries, in the news etc), article maintenance / warning templates, infoboxes, using Wiktionary for word translation.

Eventually these issues should be handled by separating wikitext and structured data (e.g. with T107595: [RFC] Multi-Content Revisions) but that's a huge project and will take a while. A quick win that would be possible right now and would make the life of developers mining structured from wikitext (and editors maintaining the wikitext) would be to create a side channel where wikitext code can output structured data (with a dedicated parserfunction Lua method), in a simple hierarchic key-value format. The data could exposed by the parser and the parse API, and eventually morph into a virtual MCR slot.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think we should invest 0 time in workarounds when we know the proper solution is MCR and know that we need to invest time in that. The more we ease the pain of not having MCR the harder it is to get it done. It's already hard enough as is.

This is not really a workaround - MCR deals with data storage and access, this would deal with data extraction from wikitext. That's necessary whether we have MCR or not. Blocking it on MCR just makes the situation worse - it is already a big jumble of various groups waiting on each other to start doing something. Breaking of small self-contained chunks from huge projects and starting to work on them is not a bad way to get things moving.

This proposal is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

FWIW, in T196440#5341715 I moot around the idea of a specialized "arglist" data type to be returned by a template, to make certain argument-list manipulations easier/more robust. I think this "structured data" input/output type would work well for that. {{#arglist}} would emit the arguments to the current template in the side channel key-value format, and {{#filter-arglist|....}} would accept that side channel format as input.

(And I agree with @Tgr's response to @Lydia_Pintscher that this issue isn't directly related to MCR. In my mind it is about what "output types" and "input types" are available for templates/extensions during preprocessing. Currently extensions can output a wikitext string or raw HTML, and as parameters they can only have a string (usually interpreted as a wikitext string). This task broadens both the possible output types and the possible input type to include structured data.)

FWIW, the ParserOutput does have such a mechanism, loosely defined in the ::set/appendExtensionData methods. What is missing is a standard way of exposing that to wikitext.

My inclination at the moment is to say that TemplateData is a good mechanism. Editors already know how to insert templates, and templates form the "microdata" mechanism of wikitext already. If you use {{author-birthdate|1976-09-27}} you're both allowing/enabling alternate styling of that information (including removing it from display entirely) as well as linking the TemplateData of author-birthdate to the parameters you've provided. If author-birthdate says (in its TemplateData) that argument 1 encodes https://www.wikidata.org/wiki/Property:P569 of its subject (the wikidata item associated with {{PAGETITLE}} by default, but presumably could be overridden by an optional second argument), we're most of the way there w/o using anything that editors aren't already familiar with. It's just a matter of hooking up the backend to write the appropriate metadata at parse time. T395968: Semantic template information in TemplateData is the task I've been using to track that basic idea.