Page MenuHomePhabricator

TemplateData's PHP JSON validation isn't strict enough because WMF cluster's HHVM allows trailing commas
Closed, ResolvedPublic1 Estimated Story Points


Commas at the end of object definitions aren't allowed, and are rejected on ingestion by the TD GUI but not on save by the TD PHP when running on HHVM without this compile flag (Zend runs it fine).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jdforrester-WMF renamed this task from TemplateData's PHP JSON validation isn't strict enough to TemplateData's PHP JSON validation isn't strict enough because WMF cluster's HHVM allows trailing commas.Feb 25 2016, 12:39 AM
Jdforrester-WMF assigned this task to ori.
Jdforrester-WMF added a project: HHVM.
Jdforrester-WMF updated the task description. (Show Details)

(Assigning to @ori per in-person discussion; working out that HHVM does this oddly thanks to @Catrope.)

Per the flag is json_utf8_loose. I'm not sure whether the UTF8->UTF16 transform at which this also affects is a bad outcome for us, though?

My gut feeling on this is that everything in MW is already supposed to be UTF-8, and invalid UTF-8 already causes exceptions etc, so not tolerating it in json_decode() should be OK. In fact, I don't think Zend PHP tolerates invalid UTF-8 in JSON either, pretty sure we've already encountered that before.

I started looking at this upstream earlier in connection to T103346: phpunit hhvm failure: LuaSandbox: TextLibraryTests[87]: json decode, invalid values (trailing comma). One tricky thing is that there are actually two JSON stacks, one using the JSONC library (binding at jsonc_parser.cpp), and one using an in-house HHVM one (depending on how it's compiled) (JSON_parser.cpp).

Does anyone know which we're using?

Also, should we merge these two tasks?

The reason for having two JSON stacks has its roots in a now-infamous clause in the JSON license: "The Software shall be used for Good, not Evil." This makes the license non-free in the eyes of Google, GNU, Fedora, Debian, and Red Hat Legal. Douglas Crockford is aware of the amount of pain this has caused and he doesn't care. It's some kind of license griefing. There's a section about this issue on en:Douglas Crockford.

This was not an issue for Facebook, but it was an issue for Debian. Facebook did not want to switch something as fundamental as the JSON parsing code in their core application runtime, so the compromise was to use either JSON or JSON-C, based on a compile-time flag. The person who actually implemented this is none other than our very own @EBernhardson.

The bit (literally!) that seems to be causing us problem is the commented-out code on lines 225-227 of hphp/runtime/ext/json/jsonc_parser.cpp:

//if (!(options & k_JSON_FB_LOOSE)) {
//    json_tokener_set_flags(tok, JSON_TOKENER_STRICT);

What this code does (or rather: would be doing, if it were not commented-out) is set the JSON-C JSON_TOKENER_STRICT flag unless the non-standard JSON_FB_LOOSE flag was set in the $options parameter for json_decode(). From this snippet, I gather that JSON-C's default behavior is nonstrict, which is what we're seeing.

Why is this code commented out? I don't know (or don't remember). Maybe Erik can shed some light.

By the way, it's not just HHVM that suffers from this licensing nightmare; see PHP bug 63520 (watch out for guest appearances by some MediaWiki personages). The third comment from the bottom talks about some of the bugs that this has caused.

I took a look back over the code I submitted but I can't remember right now why that bit of code was commented out. It might have been to do with passing the existing test cases, but that's just a guess. In the 2 years it seems to have fallen out of my head...

My guess is that you were considering the possibility of replacing JSON with JSON-C entirely, by checking how closely JSON-C's behavior in non-strict mode matched JSON w/JSON_FB_LOOSE modifications.

At any rate, I built HHVM with that code un-commented, and saw that indeed it made json_decode() strict on trailing commas and single-quotes, matching Zend's behavior. And all the tests pass. So, unless you think there is a reason not to, we could simply submit that upstream.

This is a dupe of T107132, but since this task is currently active I'm merging T107132 into this task rather than the other way around.

As a temporary workaround, you may be able to do something like this:

function isValidJson( $json ) {
	// Strip unicode escapes representing double quotes and commas
	$json = preg_replace( '/\\\u002[2c]/', '', $json );
	$decoded = json_decode( $json );
	if ( json_last_error() !== JSON_ERROR_NONE ) {
		return false;
	$encoded = json_encode( $decoded );
	return preg_match_all( '/[",]/', $json ) ===
		preg_match_all( '/[",]/', $encoded );

assert( isValidJson( '{"abc": "\u0022\u002c"}' ) );
assert( isValidJson( '{"abc": "def", "ghi": "jkl"}' ) );
assert( isValidJson( '{"a\"bc": "d,ef", "ghi": "jkl"}' ) );
assert( !isValidJson( '{"abc": "def", "ghi": "jkl",}' ), 'trailing comma' );
assert( !isValidJson( "{'abc': 'def', 'ghi': 'jkl'}" ), 'single quotes' );

Did anyone check which stack we're using in production? Jdforrester-WMF's comment is regarding JSON_parser, and Ori's is regarding jsonc_parser.

Also, I had a conversation on IRC (in hhvm on Freenode) with Orvid_, about making the JSON-C parser (and possibly the builit-in) strict. They suggested using a runtime option (INI setting) to control strictness. I haven't followed up, so I don't know if this is the broad consensus.

prod uses jsonc, i ported it from the jsonc package in pecl and upstreamed it to hhvm. Adding an ini is relatively easy and i can put that together if desired

Another json validation issue, it's very hard to see in this diff, but there was an extra closing } in the previous change and it was also allowed to be saved:

Jdforrester-WMF lowered the priority of this task from High to Medium.
Jdforrester-WMF added a subscriber: ori.
Krinkle closed this task as Resolved.EditedOct 7 2019, 10:17 PM

The issue of existing content is further tracked at T214984.

However the issue of feedback on new edits is implicitly solved by T176370 - that is we no longer run HHVM, so we no longer tolerate trailing comma's anymore. Hence we're now consistently "strict enough" as outlined in this task.