Page MenuHomePhabricator

Wikidata nodeID values sometimes start with numbers, causing parsing issues.
Closed, DeclinedPublic

Description

Problem:
A user pointed out on the Project Chat (link, permalink) that the current RDF exports are not valid RDF.

Specifically, running the following code (with rdflib 4.2.2):

G = rdflib.Graph()
G.load('https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf')

produces the error

rdflib.exceptions.ParserError: https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf:5125:2: rdf:nodeID value is not a valid NCName: 3d66a9a972a16b3583effd41e5f2aff4

The RDF specification states that a nodeID should have type rdf-id, rdf-id is equivalent to NCName, and NCNames cannot start with numbers.

Example:

import rdflib
rdflib.Graph().load('https://www.wikidata.org/wiki/Special:EntityData/Q42.rdf?revision=1283437880')

Acceptance criteria:

  • Wikidata's RDF output is valid

Notes:

  • coordinate this change with Query Service team
  • It seems that prepending a letter should fix this issue. See also Lucas' comment.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-08-28T00:34:10Z] <wm-bot> <lucaswerkmeister-wmde> deployed 7c97722445 (work around T252731)

If I understand correctly, these node IDs are specifically used for blank nodes. We used to number these sequentially, but then switched to using hashes in generate stable labels for blank nodes (compare T245541); I assume that’s what caused this issue.

Since we only recently changed these node IDs, and I believe they’re not meant to be meaningful, we can probably change them again – but we should coordinate this with the query service team to ensure the query service updater doesn’t get confused by the new format.

no objections to prefixing a letter or a couple chars here, the query service munging process can easily be adapted to remove such prefixes when skolemizing the blank nodes.

I just realized that T244341: Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS would also solve this issue, since we would no longer emit blank nodes at all.

Lydia_Pintscher subscribed.

I'm closing this then in favor of T244341 then and we'll push that forward.