Page MenuHomePhabricator

JSON-LD selection fails with JSON-LD objects including unescaped control characters in string literals
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

JSON-LD selection fails to parse the #2 JSON-LD object, complaining that control characters were found in string literals.

What should have happened instead?:

JSON-LD selection should have been able to parse the problematic JSON-LD object.

Software version (skip for WMF-hosted wikis like Wikipedia):

v2.0.0-alpha.1

Other information (browser name/version, screenshots, etc.):

This is happening because there are unescaped new lines in the articleBody property of the corresponding JSON-LD object.

Although we may escape these control characters, as mentioned here, I cannot think of an easy way to know whether a control character is inside or outside double quotes (i.e., a string literal). And escaping control characters outside string literals would result in another JSON.parse() error.

Alternatively, we may simply remove these unescaped control characters altogether.

This seems to be what schema.org's validator is doing. Unfortunately, this service seems to be provided by Google and I can't find the source code to see exactly how it does it. There is a similar tool here, but it is failing on unescaped control characters within string literals.

Noted that both Firefox and Chrome implementation of JSON.parse are failing with control characters 0x00-0x1F. That is, not failing with 0x7F or 0x80-0x9F (see https://en.wikipedia.org/wiki/C0_and_C1_control_codes).

Remember to fix the "Copy JSON-LD" bookmarklet accordingly, as well.