Page MenuHomePhabricator

Add a way to query structured output of wikipage content
Closed, InvalidPublic

Description

Would be really nice to have a built-in way to query page content in a structured way for using later in bots or scripts w/o having to rely on RegExp or writing a wikicode parser yourself. What I want is output of all elements of a page serialized in a format of your choice. This is different from action=query&prop=x or action=parse, the output must match exactly the layout of the originating wikipage. Example output in XML:

<root>
    <header level="2"> <wikilink target="Main page">Example</wikilink> </header>
    <filelink target="File:Example.png" width="250px" alt="Click me" target="Main page">Click the image or <wikilink target="Cookie">eat a cookie</wikilink>!</filelink>
</root>

for a page like this:

== [[Main page|Example]] ==
[[File:Example.png|link=Click me|Click the image or [[Cookie|eat a cookie]]!|250px]]

So I propose to add a new action to get a page's structured view, for example action=query&prop=structure. There should be also a reverse action, to convert a structured content into wikicode and another to perform an edit using modified structured view.

Some notes:

  • It should be possible to recover the original source from a structured view exactly as it is. No normalization or other changes should be done in between, also all sorts of parameters (in templates, magic words, other elements) should be in exact order.
  • The example output above is not complete, I believe all plain text should be in fact encapsulated in a separate tag, such as <text>, and it line breaks also should be passed into it.
  • Could have a parameter to request particular sections, by name or title.

Event Timeline

KPu3uC_B_Poccuu raised the priority of this task from to Needs Triage.
KPu3uC_B_Poccuu updated the task description. (Show Details)
KPu3uC_B_Poccuu added a project: MediaWiki-API.
KPu3uC_B_Poccuu added a subscriber: KPu3uC_B_Poccuu.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 17 2016, 3:42 PM
KPu3uC_B_Poccuu renamed this task from Add a way to query structured output of a wikipage content to Add a way to query structured output of wikipage content.Jan 17 2016, 3:43 PM
KPu3uC_B_Poccuu updated the task description. (Show Details)
KPu3uC_B_Poccuu set Security to None.
Anomie added a subscriber: Anomie.Jan 17 2016, 3:45 PM

It sounds like you want Parsoid. I've long said that the PHP parser should also support the metadata that allows Parsoid to go from HTML back to wikitext, but that's far deeper than a simple API feature request.

Anomie added a project: MediaWiki-Parser.

I don't care as long as any client regardless of what base it's built on can access that, including Javascript scripts.

It doesn't suffice. For one, it doesn't seem to satisfy the requirement of being able recover the exact code.

You may want to look again. The whole point of Parsoid is to be able to losslessly round-trip wikitext→HTML→wikitext, because that's what's necessary to make VE work.

It doesn't suffice. For one, it doesn't seem to satisfy the requirement of being able recover the exact code.

Please use the RESTBase api (Ex: https://rest.wikimedia.org/en.wikipedia.org/v1/?doc for enwiki) to convert between wikitext and HTML. As @Anomie pointed out, VE (and CX and Flow) depend on Parsoid to do the necessary round tripping in the presence of edits.

You are asking for an AST / IR for wikitext that also provides all of the features that Parsoid does right now (all that you request above). Early on, Parsoid chose to use HTML5 (+ additional metadata and attributes) as this representation. So, please look at DOM spec (@Arlolra linked to it above).

T114072: <section> tags for MediaWiki sections, T111674: Parsoid on serialisation should use the `format` field from TemplateData if available to set the whitespace formatting of new/edited transclusions, T104599: Parsoid on serialisation should use the `paramOrder` field from TemplateData if available to set the order of parameters in new/edited transclusions track requests that should cover some of the other things that you are requesting.

So, my recommendation is to use the existing API to recover structural information, make edits, and save it back.

I am going to close this ticket as declined later this week.

Separately, if you are familiar with mwparserfromhell, https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi documents how you can use Parsoid similarly. But, this usage is not guaranteed to not introduce dirty diffs (normalizations of white space, quotes, etc., for example). We could provide this as part of the jsapi, but only if there is sufficient justification to do that.

jayvdb closed this task as Resolved.Jan 19 2016, 12:42 AM
jayvdb changed the task status from Resolved to Invalid.