Page MenuHomePhabricator

Handling of structured data input in MediaWiki APIs
Closed, ResolvedPublic

Description

Sometimes we need a MediaWiki API module to take structured data containing arbitrarily many primitive fields as input. Some examples:

  • T182052: Batch reading list operations: we want to batch things like list creation to improve database performance and reduce bandwidth use and latency. This would involve sending some data structure like [ { "name": "...", "description": "..." }, ... ].
  • T56035: API imageinfo should allow fetching multiple thumbnail sizes: batching thumbnail URL fetches would require submitting filename/height/params triplets.
  • Multi-Content-Revisions will allow secondary content slots which hold complex data (such as some serialization of a deep associative array). This will probably open the structured data floodgate for extensions, and we'll want the API to support that.
  • We are using Wikibase in more and more places; the flexibility of data structure it provides (by limiting operations to semantic triplets) is great for editors but not so great for API clients. Eventually we'll probably want to provide high-level abstractions for reading and writing that data, which again means a need to take structured input. (An old and by now probably superseded discussion about that is T585: Finalize high-level API.)

Given the action API's choice of using the x-www-form-urlencoded input format (ie. key1=value1&key2=value2&...), the current options for doing this are not great:

  • Use a bunch of multivalue input fields, one per type. So for creating multiple lists the request would be something like listname=First|Second|Third&Listdescription=Blah|blah|blah. Besides not being easily human-readable, this deals poorly with optional fields (maybe some lists do not have a description), cannot handle non-string datatypes (null and the empty string are not always the same), or data structures deeper than one level, or any other kind of complex/flexible structure. (It's also problematic for fields that can contain pipe characters. The API has a workaround for that but it's not an elegant one.)
  • Use some standard way to encode an arbitrary structure as a string (JSON would be the obvious one) and submit that as a single parameter. This hides the real parameters and loses all the support the API has for parameters (documentation, format checking etc.) A couple of API modules seem to do this anyway (e.g. abusefiltercheckmatch, categorytree, wblexemeaddform, wikispeech, something in Linter).
  • Allow POST requests with an application/json mimetype, read parameters from the JSON body. Mostly the same problems as above; a little saner interaction with debugging tools, probably a little insaner interaction with the parameter handling of the API framework.

Adding some kind of REST API support to MediaWiki would probably be the ideal solution, but that seems like a fairly large undertaking. My immediate goal with the RfC is to find a pragmatic solution for T182052: Batch reading list operations (which needs to be done short-term) without increasing maintenance burden for the API. Having a shared understanding of how to deal with this problem in the long term would help inform that work.

Event Timeline

mobrovac subscribed.

We have the same need in the JobQueue migration project to be able to use Special:RunSignleJob, cf. T182372: Make Kafka JobQueue use Special:RunSingleJob where we need to be able to send structured events to it. Gerrit 388486 proposes to handle application/json payload in WebRequest to that end.

T56035: API imageinfo should allow fetching multiple thumbnail sizes: batching thumbnail URL fetches would require submitting filename/height/params triplets.

There's no need to submit a filename unless you're wanting different-sized thumbnails for each image, rather than multiple thumbnails of the same sizes for every image as that task requests. And if you did go in that direction, it would become a very different module from prop=imageinfo.

But the better solution for that task would be to resolve T66214: Define an official thumb API so imageinfo doesn't have to return sized thumbnail URLs at all. It would instead point to the thumb API endpoint for each image where the client could request whatever thumbnail size and other parameters it wants.

Multi-Content-Revisions will allow secondary content slots which hold complex data (such as some serialization of a deep associative array). This will probably open the structured data floodgate for extensions, and we'll want the API to support that.

This seems like a red herring. Generic write operations on such things will likely operate by submitting a whole edited document for the slot, not merge random bits into the existing data. Generic read operations are fairly likely to be returning the whole source document too rather than trying to parse out a complex data structure that may or may not actually be sensibly representable in all the API's output formats.

If some Content type can work sensibly with write operations on structured data, it'll likely want a specialized API module to handle that. And chances are fairly decent that it won't need complex generalized data structures to do so unless it gets into a batch operations requirement like you're running into.

We are using Wikibase in more and more places; the flexibility of data structure it provides (by limiting operations to semantic triplets) is great for editors but not so great for API clients. Eventually we'll probably want to provide high-level abstractions for reading and writing that data, which again means a need to take structured input.

It's very hard to figure out what we might need for something that's still in the blue-sky planning stages. Your high-level abstraction for reading might turn out to be SPARQL, and you might find that high-level abstractions for writing aren't actually needed at the API level.

Besides not being easily human-readable, [...] or data structures deeper than one level, or any other kind of complex/flexible structure.

That's true enough. Although need to supply complex data structures has historically been pretty rare.

this deals poorly with optional fields (maybe some lists do not have a description),

Only if there's no value that can be supplied to indicate "not provided" that isn't also a valid provided value. In this case, it seems to me that "do not have a description" would be well-represented by the empty string.

cannot handle non-string datatypes (null and the empty string are not always the same),

It can handle non-string datatypes fine. What you're complaining about here is a "string|null" mixed type where there's no value to represent null that can't be confused with a valid string value.

Even then, you could still do it easily enough with a little metadata encoded in the value, much like how you can differentiate a JSON object, array, string, or other scalar by looking at the first character. For example, represent null as empty-string, and strip a single leading "$" from any string. Thus null would be "", empty-string would be "$", the string "$3.50" would be "$$3.50", and the string "foobar" could be either "foobar" or "$foobar".

It's also problematic for fields that can contain pipe characters. The API has a workaround for that but it's not an elegant one.

What's not elegant about it?

  • Use some standard way to encode an arbitrary structure as a string (JSON would be the obvious one) and submit that as a single parameter. This hides the real parameters and loses all the support the API has for parameters (documentation, format checking etc.) A couple of API modules seem to do this anyway (e.g. abusefiltercheckmatch, categorytree, wblexemeaddform, wikispeech, something in Linter).
  • Allow POST requests with an application/json mimetype, read parameters from the JSON body. Mostly the same problems as above; a little saner interaction with debugging tools, probably a little insaner interaction with the parameter handling of the API framework.

As you noted, these are basically the same thing. The main difference is that the latter puts all the parameters into a JSON object rather than just the one complex parameter, and becomes harder to deal with generically in something like ApiSandbox.

BTW, I note that categorytree is being lazy and wikispeech seems like it's trying to be a bit too complicated.

Adding some kind of REST API support to MediaWiki would probably be the ideal solution, but that seems like a fairly large undertaking.

I note REST APIs are typically also bad at batching, since they tend to want to operate on exactly one resource per request. Unless the "one resource" happens to be a predefined list of all the resources you want to deal with in your batch, you're either out of luck or you're actually getting closer to some sort of RPC by submitting a processing instructions document rather than a new resource.

My immediate goal with the RfC is to find a pragmatic solution for T182052: Batch reading list operations (which needs to be done short-term) without increasing maintenance burden for the API. Having a shared understanding of how to deal with this problem in the long term would help inform that work.

Probably go with your second bullet: Define and document a "processing instructions" data structure and submit that as a parameter.

Longer term, we'd need some sort of schema language that's machine parseable, well defined, well documented, human readable (or easily converted into something human readable), and has an oojs-ui widget that can be used to edit it without having to write raw JSON/XML/whatever. Then we'd use that to add better validation and documentation to the second bullet.

We have the same need in the JobQueue migration project to be able to use Special:RunSignleJob, cf. T182372: Make Kafka JobQueue use Special:RunSingleJob where we need to be able to send structured events to it. Gerrit 388486 proposes to handle application/json payload in WebRequest to that end.

I note that's a very special use case rather than something general to MediaWiki APIs.

As I pointed out in the code review on that patch, going from "we need JSON for one specific use case" to "everything that accepts data via WebRequest should accept JSON, never mind the vastly different semantics" is a very huge step. You'd probably do better by handling it in your special page.

Tgr claimed this task.

Probably go with your second bullet: Define and document a "processing instructions" data structure and submit that as a parameter.

Works for me, thanks.

Longer term, we'd need some sort of schema language that's machine parseable, well defined, well documented, human readable (or easily converted into something human readable), and has an oojs-ui widget that can be used to edit it without having to write raw JSON/XML/whatever. Then we'd use that to add better validation and documentation to the second bullet.

JSON schema I guess? It's a big mess (bunch of different drafts, few clients support the more recent ones) but I can't think of much else. It's used in Swagger so convenient for porting to and from RESTBase, and there are form generator libraries for it (which can probably be hacked to generate OOUI forms).

In this case, it seems to me that "do not have a description" would be well-represented by the empty string.

On creation, yes. On update, setting the description to the empty string and leaving the description unchanged are both meaningful actions.

I note REST APIs are typically also bad at batching, since they tend to want to operate on exactly one resource per request. Unless the "one resource" happens to be a predefined list of all the resources you want to deal with in your batch, you're either out of luck or you're actually getting closer to some sort of RPC by submitting a processing instructions document rather than a new resource.

In this case I'll just use a static URL and put the list of resources to update in the payload. Which is arguably not technically REST but fits into the semantics intuitively there is no caching on write so there isn't any practical drawback.