Consider the following query: http://localhost/w/api.php?action=query&format=xml&action=expandtemplates&text=%ef%bf%bd%f0%90%80%80%f3%b0%80%8fzzz
It contains 6 characters: U+fffd, U+10000, U+f000f, U+007a, U+007a, and U+007a. In json encoding, they should be \ufffd\ud800\udc00\udb80\udc0fzzz (U+10000 and U+f000f must be encoded as surrogate pairs).
If I change the format to jsonfm, the three characters are instead encoded as \ufffd\ud800dc00\udb80dc0fzzz, which cannot be decoded correctly. This should be relatively simple to fix, I think.
If I change the format to json, it's even worse: the first two are output correctly as \ufffd\ud800\udc00, but that's it! Apparently PHP's built-in json_encode silently screws up anything over U+1ffff: U+20000-U+3ffff, U+80000-U+bffff, and U+100000-U+10ffff seem to be incorrectly encoded as U+10000-U+1ffff, while U+40000-U+7ffff and U+c0000-U+fffff seem to cause the mentioned silent truncation. The only fix I can think of is to detect if these characters are present and use the fallback code instead.
I'll see about posting a patch later on.
Version: 1.14.x
Severity: normal