JSON-LD selection fails with JSON-LD objects including unescaped control characters in string literals
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	diegodlh
	Sep 22 2022, 1:02 PM

Description

Steps to replicate the issue (include links if applicable):

Use JSON-LD selection on https://www.mediafax.ro/economic/salariatii-care-se-vor-autorecenza-la-recensamantul-din-2021-vor-primi-o-zi-libera-in-plus-19802619

What happens?:

JSON-LD selection fails to parse the #2 JSON-LD object, complaining that control characters were found in string literals.

What should have happened instead?:

JSON-LD selection should have been able to parse the problematic JSON-LD object.

Software version (skip for WMF-hosted wikis like Wikipedia):

v2.0.0-alpha.1

Other information (browser name/version, screenshots, etc.):

This is happening because there are unescaped new lines in the articleBody property of the corresponding JSON-LD object.

Although we may escape these control characters, as mentioned here, I cannot think of an easy way to know whether a control character is inside or outside double quotes (i.e., a string literal). And escaping control characters outside string literals would result in another JSON.parse() error.

Alternatively, we may simply remove these unescaped control characters altogether.

This seems to be what schema.org's validator is doing. Unfortunately, this service seems to be provided by Google and I can't find the source code to see exactly how it does it. There is a similar tool here, but it is failing on unescaped control characters within string literals.

Noted that both Firefox and Chrome implementation of JSON.parse are failing with control characters 0x00-0x1F. That is, not failing with 0x7F or 0x80-0x9F (see https://en.wikipedia.org/wiki/C0_and_C1_control_codes).

Remember to fix the "Copy JSON-LD" bookmarklet accordingly, as well.

Event Timeline

diegodlh created this task.Sep 22 2022, 1:02 PM

Restricted Application added a subscriber: Strainu. · View Herald TranscriptSep 22 2022, 1:02 PM

diegodlh moved this task from To do to Doing on the Web2Cit (Grant end) board.Sep 22 2022, 1:02 PM

Fixed in v2.0.0-alpha.2.

JSON-LD selection fails with JSON-LD objects including unescaped control characters in string literalsClosed, ResolvedPublicBUG REPORTActions

Description

Event Timeline

JSON-LD selection fails with JSON-LD objects including unescaped control characters in string literals
Closed, ResolvedPublicBUG REPORT
Actions