Page MenuHomePhabricator

RFC: Allow JSON values to be included in the API results
Closed, InvalidPublic

Description

As we use more and more JSON data, in some cases API modules need to return that JSON data as is, without any modifications. For example, action=graph could return the graph data in Vega format, or another module could get the template parameters exactly as written in a template's JSON parameters block.

If api module adds JSON data to the result directly, e.g. $result->addValue( null, 'graph', $jsonData ), the api's format=json will manipulate the result, deleting boolean values, and sometimes causing complicated bugs (T120227). Even though boolean sanitization could be partially solved with formatversion=2, there might still other unexpected result manipulations, causing obscure bugs. Also, for the non-JSON formats like format=xml, the result is mostly useless.

Current workaround

One option is to convert JSON to a string, but this will cause substantial increase in result size (all " need to be escaped as \"), and require additional JSON decoding on the client, complicating client code:

{
  "graph": "{ \"some\":\"data\", \"in\":\"json\", \"bool\": false }"
}
Proposal

I propose that we introduce a new "JSON" value type:

$result->addValue( null, 'graph', $jsonData, ApiResult::FLAG_JSON );

Values added with this flag will be outputted as-is by the JSON formatter.

{
  "graph": { "some":"data", "in":"json", "bool": false }
}

For XML and other formats, the value will be automatically converted into a string. This will make JSON usage much easier on the JSON-aware clients, while still allowing other formats to function correctly.

<api>
  <graph>
      { "some":"data", "in":"json", "bool": false }
  </graph>
</api>

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added subscribers: Yurik, tstarling, Anomie and 6 others.
Yurik set Security to None.

If you're really wanting to return specific JSON data for a result property, you need to encode that data as a string because we have non-JSON formats. If you're returning a data structure, that's a data structure and not JSON so this shouldn't be needed. While this sometimes-structure-sometimes-string idea would make things slightly easier for JSON-using consumers by allowing them to skip a decode step, it introduces another incompatibility between the different formats that's liable to trip people up.

I also note that, were something like this to be implemented, the PHP data structure implied to be in $jsonData (because it being a JSON string would require a rewrite of the FormatJson class) would suffer from T12887 and related bugs if using associative arrays rather than stdClass objects, might have different escaping applied to string values, and would convert floats like 3.0 to integers, all of which would defeat the stated purpose of exactly preserving JSON data.

And were this to be implemented, the solution would be more appropriately done as a new metadata property like ApiResult::META_BC_BOOLS rather than a flag to ApiResult::addValue().

Reading between the lines here and on IRC, what Yuri is really concerned about is that transformations like the boolean-to-string change in format=json&formatversion=1 will somehow change the data structure output from what he intends, but he doesn't want to set the appropriate metadata by hand or use ApiResult::addMetadataToResultVars() to automatically set it, and further he's looking at a JSON-only world and so is stuck thinking about his data as "JSON" rather than as data separate from the encoding format.

If you're really wanting to return specific JSON data for a result property, you need to encode that data as a string because we have non-JSON formats. If you're returning a data structure, that's a data structure and not JSON so this shouldn't be needed. While this sometimes-structure-sometimes-string idea would make things slightly easier for JSON-using consumers by allowing them to skip a decode step, it introduces another incompatibility between the different formats that's liable to trip people up.

Each format consumer has to be implemented for that specific format. It is never easy to switch XML parsing code to JSON parsing because their structure is so different. Yet, it will offer a very substantial benefit for JSON consumers without costs to others. So I do not think this is an issue for this case.

I also note that, were something like this to be implemented, the PHP data structure implied to be in $jsonData (because it being a JSON string would require a rewrite of the FormatJson class) would suffer from T12887 and related bugs if using associative arrays rather than stdClass objects, might have different escaping applied to string values, and would convert floats like 3.0 to integers, all of which would defeat the stated purpose of exactly preserving JSON data.

The assumption here is that since this is a "value", most likely it is stored as a whole in a database (string), and parsed with FormatJson::parse(). So the result will use stdClass, and wouldn't be affected by empty [] vs {}. As for numbers, JSON biggest consumer, JavaScript, does not differentiate between integers and floats, and neither does JSON, so this is a moot point - either 3 and 3.0 are fine in JSON, and get treated the same. For other languages with strong typing, they will probably use JSON schema, which can specify what type to parse it as.

And were this to be implemented, the solution would be more appropriately done as a new metadata property like ApiResult::META_BC_BOOLS rather than a flag to ApiResult::addValue().

Not sure how that would work - could you give an example?

Reading between the lines here and on IRC, what Yuri is really concerned about is that transformations like the boolean-to-string change in format=json&formatversion=1 will somehow change the data structure output from what he intends, but he doesn't want to set the appropriate metadata by hand or use ApiResult::addMetadataToResultVars() to automatically set it, and further he's looking at a JSON-only world and so is stuck thinking about his data as "JSON" rather than as data separate from the encoding format.

@Anomie, i am dealing with an opaque JSON value whose structure I do not know. From my understanding, for the API I have to specify each field that should be preserved in meta data. Hence I cannot say "do not transform a boolean field called blah", I can only indicate the root of the whole subtree to be preserved.
As for data vs structure - JSON data is used by a number of subsystems like Graph (Vega), GeoJSON, and others. They are mostly implemented in JavaScript, but even if they weren't, they assume all conventions set by JSON for their data. So by now JSON has became another data type just like an integer or a boolean (a number of databases even support it nativelly as field types). So for our usecase, if the code is in JavaScript, it frequently accepts JSON as a result subtree. Thus, to help with the most common usecase, we should provide a simplified way to send data since it can be handled seamlessly.

I think we have a problem here that needs solving. Including JSON as a string in JSON is silly and requires you to make a second parse. I have an example of how "nice" nested data as JSON strings feels: Wikidata. See the HTML of a page there and there is a horror:

"wbEntity":"{\"type\":\"item\",\"id\":\"Q2\",\"labels\":{\"en\":{\"language\":\"en\",\"value\":\"Earth\"},\"fr\":{\"language\":\"fr\",\"value\":\"Terre\"},\"uz\":

@MaxSem, yes, 100% agree, and I think @daniel would too.

The bottom line here is that you want vega-JSON included in api-JSON because they both happen to be types of JSON, but you need to propose complicated special handling because they're not actually the same thing (and before you claim they really are the same thing, note that if they were then you wouldn't have had the problem that led you to filing this bug in the first place).

I -2 the complicated special handling. You can either use the provided function to apply metadata to preserve bools and most underscore-keys when output as api-JSON and hope it remains compatible with vega-JSON, or you can include your vega-JSON as a string and be assured it won't be messed with as long as it's valid UTF-8.

@Anomie, what complicated special handling are you referring to? It will be extremely simple from both the API module perspective and from the API consumer perspective.

I dispute the underlying assumption in this task description:

As we use more and more JSON data, in some cases API modules need to return that JSON data as is, without any modifications. For example, action=graph could return the graph data in Vega format, or another module could get the template parameters exactly as written in a template's JSON parameters block.

Wouldnt it be better to expose new more complex data types using RESTBase?
IMO the action API should be itemized data, rather than data formats are externally specified and validated.

The existing action API oddity is the Export XML which can be wrapped or passed through literally with a special parameter.

@jayvdb, I am all for using RESTBase for this, but RESTBase does not have direct access to the MediaWiki SQL database. Which means that if the needed data is stored there, it has to be returned via api.php. So if we are stuck maintaining two different APIs, we should make both as easy to use as possible, without sacrificing existing functionality. Adding an ability for the internal api infrastructure to add JSON data as a value satisfies both.

RESTbase can convert the JSON-in-JSON/str from the action API back into the original JSON which was wrapped, and then RESTbase end-users never need to see the nastiness.

wrt template's raw JSON template data parameters, has there been a real use-case for that? The raw template JSON needs to be at least partially parsed before it can be included in the JSON API response as real JSON, as it may be invalid JSON. If it is valid JSON, the original can be included as-is, but how is that useful? If it reserialises the parsed JSON, there will be slight changes from the original, such as stripping whitespace, by the reserialisation library (which will subtly change if/when someone decides to change the reserialisation library used). Anyone really wanting a raw version of it is going to load and split the raw wikitext. Anyone wanting to easily grok the template data would prefer the parsing/normalisation is completed by the server and errors described in the API response.

Anyway, as you are proposing that the API response is API JSON with other JSON nested inside it, I dont see a serious problem with it from a client perspective.
I'd prefer that the raw JSON data element is described, by including the JSON schema URL that can be used to validate/inspect it, preferably in the declaration of the parameter and exposed in paraminfo for the action, or included in the actual result . i.e.

{
  "graph": {
    "schema": "https://raw.githubusercontent.com/vega/vega-lite/master/lib/schema.json",
    "data": {"some": "data", "in": "json", "bool": false}
  }
}

Requiring a JSON schema will prevent this new JSON datatype being used as an easy way to avoid returning API-style JSON where it is merely inconvenient. It should only be used where the data is not created within MediaWiki server.

I would be very much opposed to the API returning only the raw Vega JSON without any wrapping, as then API warnings are not possible, and the end-user needs to expect both Vega JSON and API JSON, as the action API may return an error (worst case : PHP exceptions) in API JSON format instead of Vega JSON.

RESTbase can convert the JSON-in-JSON/str from the action API back into the original JSON which was wrapped, and then RESTbase end-users never need to see the nastiness.

Are you proposing that we use RESTbase as an API wrapper? I am not totally against the idea, but it does sound a bit inefficient for RB to get data from SQL via MW just to change the interface. But in any case, creating a RB API wrapper is a separate topic. Here I'm only talking about the current MW API, and to have an ability for storage to return JSON blobs not as strings but as JSON subtrees when the return format is also JSON.

wrt template's raw JSON template data parameters, has there been a real use-case for that? The raw template JSON needs to be at least partially parsed before it can be included in the JSON API response as real JSON, as it may be invalid JSON. If it is valid JSON, the original can be included as-is, but how is that useful? If it reserialises the parsed JSON, there will be slight changes from the original, such as stripping whitespace, by the reserialisation library (which will subtly change if/when someone decides to change the reserialisation library used). Anyone really wanting a raw version of it is going to load and split the raw wikitext. Anyone wanting to easily grok the template data would prefer the parsing/normalisation is completed by the server and errors described in the API response.

You are correct that unserializing/reserializing JSON would change its formatting. But the perfect roundtripping is not usually needed when used via API. Template params usecase is an example I thought would be good to illustrate my point, and I will need to check with @Krinkle on this - like user interface showing some data entry form based on the data it gets from API, etc. Most of my experience for this issue was with graphs and maps --- where user enters some magical json data into wiki markup (either directly, or via a template or Lua expansion), and that json gets stored in the database. When graphoid, kartotherian, or a browser need to render something, they don't care about spacing or unicode normalization of the json, and they do not rely on anything non-standard like preserving the order of the keys (object keys are un-ordered according to the spec). Instead, they can take a subelement of the API response and use it.

Anyway, as you are proposing that the API response is API JSON with other JSON nested inside it, I dont see a serious problem with it from a client perspective.

Yes, that is exactly what I would like.

I'd prefer that the raw JSON data element is described, by including the JSON schema URL that can be used to validate/inspect it, preferably in the declaration of the parameter and exposed in paraminfo for the action, or included in the actual result . i.e.

{
  "graph": {
    "schema": "https://raw.githubusercontent.com/vega/vega-lite/master/lib/schema.json",
    "data": {"some": "data", "in": "json", "bool": false}
  }
}

Requiring a JSON schema will prevent this new JSON datatype being used as an easy way to avoid returning API-style JSON where it is merely inconvenient. It should only be used where the data is not created within MediaWiki server.

Even though we could potentially return the schema of the object, at this point we don't know the schema ourselves - its an arbitrary json value that is only meaningful to the Vega code. There are some talks of making Vega a W3C standard, but the work hasn't even started yet.

I would be very much opposed to the API returning only the raw Vega JSON without any wrapping, as then API warnings are not possible, and the end-user needs to expect both Vega JSON and API JSON, as the action API may return an error (worst case : PHP exceptions) in API JSON format instead of Vega JSON.

@jayvdb, I never proposed this - the API result will be in the same format as before, with the root object containing errors and warnings.

@Anomie, what complicated special handling are you referring to? It will be extremely simple from both the API module perspective and from the API consumer perspective.

From the internal ApiResult perspective. Will you also be wanting to bypass the validation that ApiResult does, and therefore also the size checking? What will need doing to actually make the formatters produce the desired output from the input data? etc.

@Anomie, I could look into that - shouldn't be too complicated. The size check should stay - we can easily calculate it by converting the object into a string (and possibly even caching it for XML formater's output). I am not sure what other validation is needed - all errors would be caught by the json_encode() during the size check.

Change 258196 had a related patch set uploaded (by Yurik):
Allow JSON values to be included in the API results

https://gerrit.wikimedia.org/r/258196

The patch for this feature has been implemented, and it turned out to be very simple. Are there any serious practical objections to this? Does it break any functionality?

@Anomie, the patch is ready and I think I addressed all of your concerns. Are there any other blockers?

No, you haven't addressed all of my concerns. You just keep ignoring the inconvenient ones.

Advantages of this over the existing solutions:

  1. Makes things slightly easier for a JSON-using producer/consumer system over the other solutions:
    1. Versus ApiResult::addMetadataToResultVars, no worry over the few things that can't take care of for you.
    2. Versus returning the oh-so-sensitive supposed-to-be-JSON data as a JSON string, the consumer can skip a JSON decode call.
  2. Versus returning the oh-so-sensitive supposed-to-be-JSON data as a JSON string, a slight size advantage through omitting some backslashes.

Disadvantages of this:

  1. Makes the logic of ApiResult more confusing, and it's already pretty confusing with all the BC stuff we have to support.
  2. Makes the validation of data added to ApiResult more difficult to reason about: bits of the result get validated in different ways depending on this "json" flag.
  3. Introduces an unnecessary incompatibility in the structures of the different API output formats.
  4. Wide open for abuse when people can't be bothered to apply the correct metadata to data that isn't oh-so-sensitive supposed-to-be-JSON data.

IMO the slight advantages come nowhere near outweighing the disadvantages.

RobLa-WMF changed the task status from Open to Stalled.Mar 2 2016, 10:45 PM
RobLa-WMF subscribed.

Per E146

Quick notes from archcom irc triage meeting:

  • keeping in blocked/stalled for now
  • @brion and @daniel are leery of the idea of packing a JSON blob in an XML element text; but also leery of the idea of translating a JSON tree into an XML tree when the end-user probably wanted a domain-specific JSON blob format
  • open questions: need to decide on the JSON-in-XML packing, whether that's desirable or should recommend just using all-JSON
  • open questions: clear use cases for the XML packing etc?

The root of this recurring conflict is the need to support other formats, especially XML.

XML is still used a fair amount, so dropping it generally is not considered an option at this point. I do wonder though if dropping XML support for specific entry points has been considered. When applied to new entry points only, this would avoid breaking existing XML users, while also enabling a gradual move towards better JSON support.

It's not that "JSON-in-XML" is a use case, it's that the API tries to be output-format agnostic, and currently supports output in JSON, XML, and PHP serialize() formats. Ideally the actual data structure output is the same no matter what the format it's being represented in, which is something this proposal wants to change.

As I see it, the question is whether the module is wanting to output a JSON blob or a data structure. In the first case the blob should be encoded as a string in all output formats, while in the second it should be given to ApiResult as a data structure for the API to turn into JSON, XML, or PHP-serialized (whichever one the client asked for). "Both" doesn't seem to be a very sensible answer to me, but that's what this proposal is trying to say.

There's also the possibility of XML-in-JSON, BTW. And XML-in-XML and XML-in-PHP for that matter. That particular example also has the possibility of just dumping the XML without any wrapper, although IMO that's a bit ugly since it loses the ability to receive API warnings and such.

this would avoid breaking existing XML users

Until one of those existing XML users wants to make use of one of your new API modules. Then they're stuck either rewriting everything to use JSON or having most of their code using XML with this one wart using JSON.

And if it's a query submodule that you're talking about, then it's flat-out insane for it to try to require JSON-only. Which would be likely to lead to modules that should be query submodules being done as actions instead.

RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM

@Anomie write: It's not that "JSON-in-XML" is a use case, it's that the API tries to be output-format agnostic, and currently supports output in JSON, XML, and PHP serialize() formats. Ideally the actual data structure output is the same no matter what the format it's being represented in, which is something this proposal wants to change.

This is a valid point, but we can still be output format agnostic by allows the ouput data to be polymorphic, allowing values to be either raw strings (encoded according to the output format for embedding it correctly), or explicitly in some known collection-type : for the JSON and PHP output format, it is relatively easy (very minor difference of syntax). But for allowing a XML or JSON or PHP output value to be returned as a preparsed collection-type (matching the output format), there's a need to validate its own internal format, so that encoding it would not be necessary and would not break the "envelope" output format (this is a problem when the ouput is XML and you want to avoid the encoding of XML values to include it directly in the inner content of an outer element of the envelope output XML format, and you'll need to use n explicita namespace switch to avoid collision, and so you'll need to insert an additional XML namespace declaration to the outer element of the envelope output XML format to avoid validation problems for the envelope format itself).

Now if you want to avoid multiple outformat parsers and want transforms from JSON to XML or the reverse, we will first need to decide on a translation layer, basically by adding a specification for the schema (and note that XML schemas are not the same as PHP data schema or JSON data schema, they are not in the same namespace, which can still be specified by an URL, or by using a basic URN consisting in an internal URI prefix for MediaWiki API "mediawiki:" and then an identifier of the supported format (probably the standard MIME types associated to the formats; but PHP data has no standard MIME type and we'll need a private extension such as "mediawiki:text/plain+php").

We don't have a lot of supported output formats, but we need to create a translation layer for each pair of (value-format-type, output-format-type): if some pairs are missing the default encpasulation will convert it to encoded text. As many clients will not be ready to interpret polymorphic value types, we need a flag to accept such transform, and without it the output will need to be raw text and encoded according to the ouput format, without any interpretation. Some values will still be returned as raw text (notably if they return plain HTML, which is not easily encapsulable in XML/JSON/PHP without using encoding and using secondary parsers for the clients, or when the values will return any kind of unexpected value not validating their expected format.

So the complex cases are still when the output value is in HTML, or non-validated XML formats, or when they contain truncated results (so they do not fully validate JSON or PHP data format specifications, with missing closing tags/braces/brackets/parentheses): raw encoded text will be still returned but the client-side secondary parser (if they are used) will detect violations. Normally we could deprecate the client-side use of secondary parsers if format validation is already performed by the server: you'll get the values already parsed as in the output format, or otherwise an encoded raw string, that clients will not necessarily need to interpret, but use only for diagnostic purpose

That is a whole lot of words to propose something even more complicated and less likely to ever happen than the original proposal here.

Aklapper changed the task status from Stalled to Open.May 21 2020, 10:52 AM

Task not stalled ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on") on a meeting in 2016 anymore (see T120380#2082074), hence resetting task status. Feel free to correct.

(Smallprint, as general orientation for task management: If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead. If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks. If this task is stalled on an upstream project, then the Upstream tag should be added. If this task requires info from the task reporter, then there should be instructions which info is needed. If this task is out of scope and nobody should ever work on this, then task status should have the "Declined" status.)

Closing old RFC that is not yet on to our 2020 process and does not appear to have an active owner. Feel free to re-open with our template or file a new one when that changes.

Jdforrester-WMF changed the task status from Declined to Invalid.Sep 16 2020, 8:15 PM
Jdforrester-WMF subscribed.

Not Declined.