Page MenuHomePhabricator

Empty JSON keys are replaced with "_empty_"
Closed, ResolvedPublic

Description

  1. Edit or create any JSON page like https://en.wikipedia.org/wiki/Special:MyPage/test.json so that it contains
{
    "": "test"
}
  1. Save or preview
  2. The key that is supposed to be an empty string is now instead the string _empty_

Event Timeline

It is also possible that this bug is wanted, to avoid empty dict key in json files.

Can't find any in mediawiki core code, the change is probably done by php (looks to this report about json_decode()). So the question is rather if we want to allow empty dict keys in json pages.

It is also possible that this bug is wanted, to avoid empty dict key in json files.

No. The current behaviour is definitely wrong. Empty keys are allowed in JSON. If empty keys should be discouraged, the code editor should throw a warning about that instead. The input should never be mangled like this under any circumstances.

Legoktm subscribed.

https://3v4l.org/SAHin

This is fixed in PHP 7.1+, so we need some hack/compat code for PHP 7.0 and HHVM.

Krinkle moved this task from Backlog to Wikimedia production on the PHP 7.0 support board.
Krinkle subscribed.

@Legoktm Would this be something to handle at individual consumers? (e.g. EventLogging, Maps), in ContentHandler (e.g. JSON), or lower down at FormatJson?

I think FormatJson. I can't imagine any reason why callers would actually want this behavior.

Some relevant links:

http://php.net/manual/en/migration71.incompatible.php
https://bugs.php.net/bug.php?id=46600
https://marc.info/?l=php-internals&m=145892797606179&w=2

Notable behavior:

This code:

print json_decode( '{"":"test"}' )->{''};

results in "Cannot access empty property" on PHP 7.0, but outputs "test" on PHP 7.2.

These people also grappled with this, although I'm not certain their solution is much help to us: https://www.drupal.org/project/search_api_solr/issues/2970829

Regarding detection, it is impossible (in PHP 7.0) to tell, base strictly on the output of json_decode, if the issue has occurred. This is because multiple inputs produce the same output. The following decodes three different JSON strings:

$jsonStrs = [
	'{"":"test"}',
	'{"_empty_":"test"}',
	'{"\u005F\u0065\u006D\u0070\u0074\u0079\u005F":"test"}'
];
foreach ( $jsonStrs as $str ) {
	print "str: $str<br />";
	var_dump( FormatJson::decode( $str ) );
}

It produces the following output. Notice that the output object is identical for all three input strings.

str: {"":"test"}
object(stdClass)[239]
  public '_empty_' => string 'test' (length=4)

str: {"_empty_":"test"}
object(stdClass)[239]
  public '_empty_' => string 'test' (length=4)

str: {"\u005F\u0065\u006D\u0070\u0074\u0079\u005F":"test"}
object(stdClass)[239]
  public '_empty_' => string 'test' (length=4)

If, for whatever reason, a caller actually passes "_empty_ as a key (or some other representation of it that JSON accepts), that key will appear in the output object, indistinguishable from the "_empty_" key that is automatically created if an actual empty string key is passed. Which leads us to this fun case:

$jsonStrs = [
	'{"":"test1","_empty_":"test2"}',
	'{"_empty_":"test2","":"test1"}'
];
foreach ( $jsonStrs as $str ) {
	print "str: $str<br />";
	var_dump( FormatJson::decode( $str ) );
}

Notice that the only difference between the two strings is ordering. These strings produce:

str: {"":"test1","_empty_":"test2"}
object(stdClass)[239]
  public '_empty_' => string 'test2' (length=5)

str: {"_empty_":"test2","":"test1"}
object(stdClass)[239]
  public '_empty_' => string 'test1' (length=5)

If the input JSON string includes both the key "_empty_" and an actual empty key, the last one wins.

Yuck.

Mitigation is challenging.

Within the limitations described above, we can (mostly) detect if the issue has occurred:

  1. check PHP version (it can only occur if < 7.1)
  2. check for presence of an "_empty_" key in the returned object
  3. maybe use a regex or two on the input to see if it truly appears that a null key was passed rather than the string "_empty_" being passed intentionally as a key.

Not sure yet if the regex part would make things better or worse.

But with that said - suppose we have decided, to the best of our ability, that the issue has occurred. What then?

We can't set an empty property pre-PHP 7.1. The PHP language doesn't allow it. In fact, the whole reason they added empty properties to the language was to handle exactly this situation with json_decode. So what should FormatJson::decode return if the issue occurs? Some options:

  • return null (indicating the value could not be decoded), which sounds a bit nuclear
  • omit that property entirely, removing it from the returned object, which may remove desired data

Both those options eliminate the ability of callers to self-mitigate in whatever way makes the. most sense within their context.

The best idea I have so far is to document this behavior in the function comment, and recommend that callers pass the $assoc parameter as true, to receive an array rather than an object. Arrays can have an empty string as a key, so they don't suffer from the same language issue as an empty property.

If no one has any better ideas, I'm happy to refactor affected callers. Most already seem to request an array rather than an object, so the scope of that change isn't too intimidating.

The best idea I have so far is to document this behavior in the function comment, and recommend that callers pass the $assoc parameter as true, to receive an array rather than an object. Arrays can have an empty string as a key, so they don't suffer from the same language issue as an empty property.

If no one has any better ideas, I'm happy to refactor affected callers. Most already seem to request an array rather than an object, so the scope of that change isn't too intimidating.

I think you should go ahead with this. Document and change affected callers to use $assoc=true.

This issue also affects callers of FormatJson::parse, if FORCE_ASSOC is not specified. I will include those callers in the change.

Unfortunately, there's an additional and competing complication: empty JSON objects.

var_dump( json_encode( json_decode( '{}' ) ) );
=> string '{}'

var_dump( json_encode( json_decode( '{}', true ) ) );
=> string '[]'

This also affects nested empty objects:

var_dump( json_encode( json_decode( '{ "foo": {}, "bar": [] }' ) ) );
=> '{"foo":{},"bar":[]}'

var_dump( json_encode( json_decode( '{ "foo": {}, "bar": [] }', true ) ) );
=> '{"foo":[],"bar":[]}'

This breaks the round trip consistency required by our json pages.

Short of writing our own json parser, I don't see a way to work around both the _empty_ problem and the empty object problem on our .json pages. If we switch to decoding as an associative array to avoid the _empty_ problem, then empty objects are converted to arrays and break round trip consistency. But if we continue decoding as an object, the _empty_ problem persists.

Because the _empty_ problem goes away with PHP 7.1 (and higher), I lean toward just living with the current issue on our .json pages until we upgrade PHP.

I still plan to switch callers where round trip consistency is not required and where decoding as an associative array does not change the public interface of widely used classes. (That work is already in progress.)

Change 472233 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Comments, tests, and tweaks for JSON decoding quirks

https://gerrit.wikimedia.org/r/472233

Change 472233 merged by jenkins-bot:
[mediawiki/core@master] Comments, tests, and tweaks for JSON decoding quirks

https://gerrit.wikimedia.org/r/472233