Page MenuHomePhabricator

Structured data side channel for wikitext
Open, Needs TriagePublic


The problem of passing structured data from wikitext to external applications comes up in a wide variety of contexts, and a garden of ugly workarounds has grown around it, usually consisting of encoding the data in the HTML rendered from wikitext in some way, then external applications parsing it out and restoring the structure. Examples include CommonsMetadata, the various services (Mobile-Content-Service, all kinds of Tool Labs tools) exposing mainpage/featured content (article/picture of the day, anniversaries, in the news etc), article maintenance / warning templates, infoboxes, using Wiktionary for word translation.

Eventually these issues should be handled by separating wikitext and structured data (e.g. with T107595: [RFC] Multi-Content Revisions) but that's a huge project and will take a while. A quick win that would be possible right now and would make the life of developers mining structured from wikitext (and editors maintaining the wikitext) would be to create a side channel where wikitext code can output structured data (with a dedicated parserfunction Lua method), in a simple hierarchic key-value format. The data could exposed by the parser and the parse API, and eventually morph into a virtual MCR slot.

Event Timeline

Tgr created this task.Jan 31 2017, 11:47 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptJan 31 2017, 11:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think we should invest 0 time in workarounds when we know the proper solution is MCR and know that we need to invest time in that. The more we ease the pain of not having MCR the harder it is to get it done. It's already hard enough as is.

Tgr added a comment.Feb 1 2017, 10:22 PM

This is not really a workaround - MCR deals with data storage and access, this would deal with data extraction from wikitext. That's necessary whether we have MCR or not. Blocking it on MCR just makes the situation worse - it is already a big jumble of various groups waiting on each other to start doing something. Breaking of small self-contained chunks from huge projects and starting to work on them is not a bad way to get things moving.

This proposal is selected for the Developer-Wishlist voting round and will be added to a MediaWiki page very soon. To the subscribers, or proposer of this task: please help modify the task description: add a brief summary (10-12 lines) of the problem that this proposal raises, topics discussed in the comments, and a proposed solution (if there is any yet). Remember to add a header with a title "Description," to your content. Please do so before February 5th, 12:00 pm UTC.

bearND added a subscriber: bearND.Feb 7 2017, 7:06 PM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:25 PM
He7d3r added a subscriber: He7d3r.Jan 11 2018, 7:52 PM
cscott added a subscriber: cscott.Aug 8 2019, 8:41 AM

FWIW, in T196440#5341715 I moot around the idea of a specialized "arglist" data type to be returned by a template, to make certain argument-list manipulations easier/more robust. I think this "structured data" input/output type would work well for that. {{#arglist}} would emit the arguments to the current template in the side channel key-value format, and {{#filter-arglist|....}} would accept that side channel format as input.

(And I agree with @Tgr's response to @Lydia_Pintscher that this issue isn't directly related to MCR. In my mind it is about what "output types" and "input types" are available for templates/extensions during preprocessing. Currently extensions can output a wikitext string or raw HTML, and as parameters they can only have a string (usually interpreted as a wikitext string). This task broadens both the possible output types and the possible input type to include structured data.)