Page MenuHomePhabricator

Non conform turtle syntax for RDF dump
Closed, ResolvedPublic

Description

Hello,

I would like to report a bug in the rdf dump offered by wikidata. It is great that you offer the data in rdf!
I downloaded the following dump:

wikidata-20160829-all-BETA.ttl

Unfortunately it is not valid turtle syntax. If you parse it you will get an error. It appears around the entity Q815674. Unfortunately one of the labels is "\a". This is not accept in turtle due to the backslash. I found it very difficult to find this error and it was also difficult to eliminate it since it is a 70 gb big file. I would suggest in future to parse once the file and check if it is valid before publishing.

Thank you
d063520

Event Timeline

For reference: "\a" violates section 2.5.1 of the RDF 1.1 Turle spec: https://www.w3.org/TR/turtle/#turtle-literals

The relevant code for escaping literals in Turtle is in N3Quoter::escapeLiteral in the Purtle component. A quick test shows that it correctly turns '\a' into '\\a'. Further investigation is needed to confirm and locate the problem and to identify the cause.

daniel triaged this task as High priority.Sep 15 2016, 9:16 AM
daniel added a subscriber: Smalyshev.

@D063520 do you happen to know in which item the invalid label occurs, or what language is associated with the literal?

Confirmed: the Turtle representation of Q815674 contains an un-escaped "\a", as can be seen at https://www.wikidata.org/entity/Q815674.ttl. It is not the label "\a" that causes the problem, this is correctly escaped to "\\a". The problem is caused by the statement value for P487 (Unicode Character) which contains a literal U+0007, which gets escaped to "\a". The escape sequence is valid for U+0007 in many languages, but apparently not in Turtle. In Turtle, U+0007 needs to be written as "\u0007".

Relevant Turtle snippet:

        wd:Q815674 a wikibase:Item ;
        .....
	skos:altLabel "Caractère D'appel"@fr,
		"Caractere d'appel"@fr,
		"Bell character"@fr,
		"␇"@de,
		"\\a"@de,                    <---------- VALID
		"BEL"@de;
	wdt:P31 wd:Q617945 ;
	wdt:P487 "\a" ;                      <---------- INVALID, should be "\u0007"

A quick test with N3Quoter::escapeLiteral shows that indeed, it turns "\x07" into '\a' instead of '\u0007'.

Thank you very much Daniel for taking this over and addressing it so fast. Dennis

Interestingly enough, Java seems to be able to handle that. But yes, it looks like \a is a wrong, and it's because we're using addcslashes which may not be what we want there.

This makes me wonder whether we should just disallow code points below U+0020 in all string input.

Well, I'm not sure banning \n is good, esp. given we have something like P2559.

Also, of course, we couldn't represent Q815674 properly then. Not a huge loss, but still. Turtle seems to be fine with properly-encoded low code points, so I'm not sure whether we need to. No idea how other UI parts would react to \0 or if all other CPs are safe though.

@Smalyshev what worries me is that some characters are apparently illegal in RDF, even if we can encode them in N3. https://www.w3.org/TeamSubmission/n3/#escaping says: Some escapes (\a, \b, \f, \v) should be avoided because the corresponding characters are not allowed in RDF.

Hmm good point. We need to dig up which characters are allowed then.

This: https://www.w3.org/TR/2004/REC-rdf-testcases-20040210/#ntrip_strings

seems to include all low chars, so maybe it was changed later?

Seems to be working on test.wikidata.org, see: https://test.wikidata.org/wiki/Special:EntityData/Q42.ttl property P664.