Page MenuHomePhabricator

PHP7's stricter JSON parsing breaks some wiki content
Closed, ResolvedPublic

Description

Summary: HHVM's JSON parsing is lax in some respects: it allows trailing commas in arrays and objects; allows single quotes around strings; ignores trailing text in some cases; and allows control characters U+0001–U+001F to appear unescaped in strings, which includes tab (U+0009, \t), carriage return (U+000D, \r), and linefeed (U+000A, \n).

Templates or modules generating JSON with these errors for things like <mapframe> and <graph> will be accepted under HHVM but will be (correctly) rejected under PHP7. Such templates/modules should be fixed to produce valid JSON.

We've had various tasks over the years pointing out this HHVM laxity, including T74778, T103346, T128029, T184694, and T207523.


Sorry if this is not the right way to go about doing this. I have recently been adding the Highway system OSM map template to many pages without highway maps. This template basically displays a map with the red line being the highway the page is about and the green line being the highways that are part of the same system.
Normally, it renders like this:

image.png (310×313 px, 51 KB)

However, in PHP7 it displays this error:
image.png (24×288 px, 1 KB)

(Plaintext: <mapframe> Couldn't parse JSON: Syntax error).
What confuses me is why this new version of PHP would affect any template or Lua module. Even though this issue is specific to this one template so far, it seems rather odd that the new implementation of PHP would mess with any of the template and module code.

Steps to Reproduce:

  1. Log into https://en.wikipedia.org.
  2. In the "Beta" tab, make sure PHP7 is checked.
  3. Go to a page using the Highway system OSM map template (in this example, the Doto Expressway page).
  4. Purge the page to erase the cache, or preview the page in Edit mode, then you will see this bug.

I am using Firefox 64. This issue has happened on every page with this template whenever I have PHP7 turned on.

Thank you in advance for helping.

Related Objects

Event Timeline

This is probably because HHVM accepts slightly invalid input JSON (trailing commas, maybe other things too), which I guess PHP7 doesn't.

A similar problem occurs on frwiki on page w:fr:Frontière entre le Honduras et le Mexique.

I activated the PHP7 option in beta.

The following error message appears in plaintext on article:

Erreur Lua : mw.text.jsonDecode: Erreur de syntaxe

If I purge the page with PHP7 enabled, this version of the page is cached, and this caching version is also served (with the error message) if not connected (testing in private browsing mode on Firefox).

(sorry for my bad English)

Poking at this carefully: afaict the wikidata query that gets run is https://query.wikidata.org/#SELECT%20%3Fid%20%3Flength%0A%20%20%28if%28%3Fid%20%3D%20wd%3AQ867944%2C%20%27%23C12838%27%2C%20%27%2307c63e%27%29%20as%20%3Fstroke%29%0A%20%20%28concat%28%27Line%20length%3A%20%27%2C%20str%28%3Flength%29%2C%20%27%20km%27%29%20as%20%3Fdescription%29%0A%20%20%28if%28BOUND%28%3Flink%29%2C%0A%20%20%20%20%20%20concat%28%27%5B%5B%27%2C%20substr%28str%28%3Flink%29%2C31%2C500%29%2C%20%27%7C%27%2C%20%3FidLabel%2C%20%27%5D%5D%27%29%2C%0A%20%20%20%20%20%20%3FidLabel%29%0A%20%20%20as%20%3Ftitle%29%0AWHERE%20%7B%20%20%0A%20%20%7B%3Fid%20wdt%3AP16%20wd%3AQ5228578.%7D%0A%20%20OPTIONAL%20%7B%20%3Fid%20wdt%3AP2043%20%3Flength.%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%0A%20%20%20%20bd%3AserviceParam%20wikibase%3Alanguage%20%27en%27%20.%0A%20%20%20%20%3Fid%20rdfs%3Alabel%20%3FidLabel%20.%0A%20%20%7D%0A%20%20OPTIONAL%20%7B%3Flink%20schema%3Aabout%20%3Fid.%0A%20%20%3Flink%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%7D%0A%7D%20GROUP%20BY%20%3Fid%20%3Flink%20%3FidLabel%20%3Flength%0A
I have gotten this by plugging in Q5228578 for the whole if clause with highway_system_qid and {{wikidata|property|raw|P16}}, Q867944 for {{wikidata|label|raw}}, (grabbed this value by evaluating the templatthe pipe on a preview copy of the Dōtō Expressway article), and by replacing {{!}} with the pipe symbol. But the result gives a list of all the expressways in Japan, not just the one we want. Did I get the query right? Surely this is not what's desired.

@Jdforrester-WMF - That makes sense, so I guess the next step is to look at how to get the JSON and modify the template code to make it output correctly.

@ArielGlenn - The template is supposed to get every expressway in Japan, since it is meant to make a map of the entire system.

Fixed one trailing comma.

But this exposed that the template isn't escaping the string before putting it into json. real line endings are not possible inside JSON strings and should be escaped with \n.
This should be reimplemented using lua and mw.text.jsonEncode()

This is what Special:ExpandTemplates currently produces for this template on that page (after my trailing comma fix).

<mapframe zoom="3" align="center" text="Map" height="200" width="300">{
  "type": "ExternalData",
  "service": "geoline",
  "properties": {
    "stroke-width": 2
  },
  "query": "

SELECT ?id ?length
  (if(?id = wd:Q867944, '#C12838', '#07c63e') as ?stroke)
  (concat('Line length: ', str(?length), ' km') as ?description)
  (if(BOUND(?link),
      concat('[[', substr(str(?link),31,500), '|', ?idLabel, ']]'),
      ?idLabel)
   as ?title)
WHERE {  
  {?id wdt:P16 wd:Q5228578.}
  
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language 'en' .
    ?id rdfs:label ?idLabel .
  }
  OPTIONAL {?link schema:about ?id.
  ?link schema:isPartOf <https://en.wikipedia.org/>.}
} GROUP BY ?id ?link ?idLabel ?length

"}</mapframe>

The Template:OSM Location map is showing "Syntax error" as an output for all instances of its use (for example, 2019 Jolo Cathedral bombings and Alameda, California#Attractions). Reported at Village Pump here and at MediaWiki Talk:Beta Features here.

Screenshot 2019-02-07 at 13.07.38.png (64×381 px, 10 KB)

OSM Location map, is using the graph extension and passes a json blob to it. This json blob has similar errors, including trailing commas, // comments etc..

I also found the following related issues:

*edit* I seem to have fixed this specific case by chasing trailing comma's and an incorrect value
Interestingly enough that seems to indicate that comments are allowed or ignored for some reason...

@TheDJ Thank you for fixing the trailing commas. I did figure it was related to comma placement in some way, but I mostly focused my efforts on commas with spaces before them. I will look at reimplementation, especially since there is still an error produced by the highway systems template in PHP7.

screenshot-en.wikipedia.org-2019.02.07-07-32-33.png (18×512 px, 4 KB)

(Plaintext: <mapframe>: Couldn't parse JSON: Control character error, possibly incorrectly encoded)

Anomie renamed this task from PHP7 migration affecting Lua scripts or templates? to PHP7's stricter JSON parsing breaks some wiki content.Feb 7 2019, 2:28 PM
Anomie updated the task description. (Show Details)
Anomie subscribed.

I've updated the title of the task and added a summary to reflect the underlying issue here.

Sort of brute force fixed this using:
mw.ustring( jsonstring, "%c", " ")

https://en.wikipedia.org/w/index.php?title=Module%3AMapframe&type=revision&diff=882266119&oldid=876305861

Since control characters are useless in both json and sparql i just strip them all and replace with a space char.. Possibly this was only affecting the raw variable... but too tired to figure that out right now ;)

Seems this fixes dynamic maps, but static pre-rendered map is still broken...

Yeah, even in HHVM the thumbnail doesn't render correctly (unless you are in editing mode, because there the thumbnail is actually a smaller dynmap and not an image). Still, thanks for fixing the dynmap.

Yeah, even in HHVM the thumbnail doesn't render correctly (unless you are in editing mode). Still, thanks for fixing the dynmap.

That groupId ticket notes: "In most cases they aren't caused by errors in the system, rather by invalid parameters and that can be handled better."

So possibly, there is still something wrong with the values of some of the keys.. Hard to figure out this way WHAT is wrong though...

Weirdly enough, the pre-rendered map now works in my sandbox on English Wikipedia, but nowhere else with the highway systems template. Maybe something to do with the Wikidata ID entry? My sandbox page doesn't have a Wikidata ID, but the other pages I looked at did.

After removal of that one trailing comma, I have done the sandbox trick and stuffed the wikidata id entry in where it should be (Q867944 in place of (if(?id = wd:blah...) and I get a nice little rendered map with two stroke colors and the whole thing. This was taking a copy of the OSM template and plugging some values directly into it (lat/long, frame width/height, zoom, plain=yes). So uh? I must be overlooking something obvious.

Ah note also that if you click through on the (empty) map in the wp article, you do see a properly rendered full screen map. Huh!

The small empty map on the article is the issue - something is wrong with a parameter, so that small map doesn't get rendered while the full screen one is now properly rendered. That small map basically uses an image of what the map would normally display in order to cut down loading times and avoid accidentally moving the map window to focus elsewhere.

OK, I fixed the thumbnail issue by removing some trailing spaces:

https://en.wikipedia.org/w/index.php?title=Template%3AHighway_system_OSM_map&type=revision&diff=882961558&oldid=882213517

It does take some time for the thumbnail to load, but it works.

Strangely enough, when I tried this change at zhwiki it actually broke the template in a similar matter. I will do some more testing in HHVM and PHP7 and see what happens.

Do we know which of those spaces in particular was the culprit, or if all of them were? Asking just out of curiosity.

Yeah, it looks like the spaces for assigning a Wikidata ID for the highway system property (P16) to show was the issue. It must have caused a syntax issue with the stricter parser.

I changed

{?id wdt:P16 wd:{{#if: {{{highway_system_qid|}}} | {{{highway_system_qid}}} | {{wikidata|property|raw|P16}} }}.}

to

{?id wdt:P16 wd:{{#if:{{{highway_system_qid|}}}|{{{highway_system_qid}}}|{{wikidata|property|raw|P16}}}}.}

(note the spaces in front of and behind each spot for the passed parameter and the Wikidata property template).
Also, I did some testing on enwiki and it seems that the template now works in both HHVM and PHP7. I'll test zhwiki again.

EDIT: Nevermind, its back to being broken on my desktop. How strange... Maybe it might be because I am running Firefox compared to my Chromebook which is running, well... However, the manual query works, so I think that I will still have a good chance at fixing this.

OK, so for some reason the way the highway system parameter is being called is causing issues. If the highway system parameter is specified, it will break, but if I don't supply a parameter and let the Wikidata template find it (it looks for the ID for property P16) it will work. Interesting...

EDIT: Found out that the issue was extra whitespace in the length parameter. I guess there will have to be a lot of spacing changes for many templates to compensate for PHP7, but maybe I will be pleasantly surprised.

Let me add content from << T217103: Extension:Kartographer producing "Couldn't parse JSON" errors when SPARQL includes queryhints >> since this seems a part of the issue raised here and may provide an additional failure-mode example:

There seems to be an issue, believed to be Extension:Kartographer producing "Couldn't parse JSON" errors when SPARQL includes queryhints, described at https://www.wikidata.org/w/index.php?title=Wikidata:Request_a_query&oldid=867570503#Optimizing_a_query_for_the_mapframe_extension

I'm afraid it's all a bit over my head. Grateful for some knowledgeable input.

I'm afraid it's all a bit over my head. Grateful for some knowledgeable input.

Fixed: A trailing comma, and the usage of " inside a json value, prematurely ending the json value.

That's not a fix, so much as a helpful workaround.

If we take the not unreasonable view that Blazegraph should be allowed to specify what is valid syntax for hints on their own engine - see https://wiki.blazegraph.com/wiki/index.php/QueryHints which specifies the user of " - then the parser still has a fail.

Forcing users to learn, magically, from the ether, that non-standard syntax must be offered to the parser and that the query engine makers specification has been disregarded by the parser constructors, is woefully suboptimal.

@Tagishsimon: if you want to put a value containing " insides a JSON string, you need to escape it, as for example \" or \u0022. You'd also need to escape backslashes (\) and various control characters (although en:Module:Mapframe appears to blindly replace all control characters with spaces for you).

@TheDJ just took the more readable route of replacing the " with ', since blazegraph apparently accepts either type of quote.

Interestingly enough \" doesnt actually seem to work, not even when directly embedding the json blob inside a <mapframe> tag. Might be an issue in the kartographer extension.

So I see two ways to move forward:

  • Write a bot to fix bad jsons that was allowed through hhvm until we fully deprecate hhvm.
  • Make mediawiki more strict on accepting json until the migration is over.

I know how to do 1, but I have no idea how we can do 2 while 2 seems a better approach. At the end, it's just a migration issue, no permanent solution is needed.

We should stick to multiline relaxed strings, while regarding superfluous comma and apostrophe delimiters as errors.

Imagine there is some wikitext stored as JSON string:

"Page content:\n==Headline==\n*first item\n*second item\nThe story to be told."

Looks better if formatted as

"Page content:
==Headline==
*first item
*second item
The story to be told"

If I imagine an entire paragraph or page of wikitext needs to be passed as one single line only I am not really happy. The same goes for some kind of formatted data lines within one JSON string value, like assigments key1=value1 each of them written on a single line, to be interpreted later. That should be kept as multiline human readable source format.

All JSON blob will be piped through either

  • rMW/includes/json/FormatJson.php
  • rMW/includes/content/JsonContent.php →FormatJson

and final encoding U+000A→\n etc. within strings may be done immediately before passing blob to PHP7::json_decode().

On prettyprint \n \t might be presented nicely.

For those who are dealing with JSON code within wiki pages as template parameter or within Lua modules I did prepare Module:JSONutil which will show more elaborated messages like these test cases.

It can be used to detect problematic blobs within a wiki right now under HHVM conditions.

Additional note on trailing garbage:

  • Recent PHP apparently stops analysis when opening bracket { has been matched by closing }.
  • PHP7 JSON does not return any result if closing last } is followed by non-whitespace data.

Even more, unbalanced (missing) terminating } at end of data seem to be granted by lazy PHP while strict PHP7 is insisting upon correct assignment.

Apparently // or /* ... */ comments were accepted but not by PHP7, if I observed correctly.

currently, on wikimedia sites, the parsing _is_ lax, despite what the documentation says.

it's been like this for a long time (at least 3 years), and it should not be changed now.

however, a new flag should be added, to ask for _strict_ json syntax check.

also, it seems that there's a rogue server, which somehow has different version of the library, with strict check, so every so often, '[1,2,3,]' will fail parsing, even though it normally succeeds.

(every so often, a lua function that normally succeeds, fails on some pages in hewiki, which case the page to be added to some maintenance cat. when you open the page there's no error, but every so ofteh it "appears" in the cat, and this only happens to pages which cause jsonDecoding of json which should fail with strict rules and pass with lax).

hope this is not too confusing. please see https://www.mediawiki.org/wiki/Extension_talk:Scribunto/Lua_reference_manual#mw.text.jsonDecode for more details, and demonstration of the bug.

peace

@Kipod: You may have a look into JSONutil@dewiki, check the examples provided at JSONutil Test, use the JSONutil.fetch(source,true,"he") function for now rather than mw.text.jsonDecode(source), clean up all obscure syntax from your wiki by maintenance cat step by step (the dubious syntax won’t survive the next decades, one day you will need pretty syntax anyway); and a Hebrew translation at commons:Data is welcome.

Shalom

currently, on wikimedia sites, the parsing _is_ lax, despite what the documentation says.

it's been like this for a long time (at least 3 years), and it should not be changed now.

however, a new flag should be added, to ask for _strict_ json syntax check.

The JSON decoding uses functionality provided by PHP (ultimately via C code), and we're not going to rewrite it from scratch (in what would likely turn out to be a slower and more memory-intensive manner) to preserve buggy behavior.

also, it seems that there's a rogue server, which somehow has different version of the library, with strict check, so every so often, '[1,2,3,]' will fail parsing, even though it normally succeeds.

It's not a "rogue server", it's any request that uses PHP7. See T176370: Migrate to PHP 7 in WMF production.

Do I get this right, to make the following showcase map work without Lua, now there isn't a more readable way than removing all newlines from SPARQL?

https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_via_Wikidata_Query

Do I get this right, to make the following showcase map work without Lua, now there isn't a more readable way than removing all newlines from SPARQL?

https://www.mediawiki.org/wiki/Help:Extension:Kartographer#GeoShapes_via_Wikidata_Query

That was never valid JSON, yes.

TheDJ claimed this task.

i think we dealt with most problematic content here. Don't think there are any further actions.