Page MenuHomePhabricator

Incorrect use of blank node to represent unknown and no value in Wikidata
Open, Needs TriagePublicBUG REPORT

Description

During the BioHackathon where we are working on subsetting Wikidata, we ran into the issue of blank nodes being used in the RDF of Wikidata to express unknown and no values. Unfortunately, this isn't consistent because blank nodes are also used to express other things such as owl:complementOf (e.g. Q42).

These blank nodes are also problematic for anything that traverses Wikidata node-by-node such as faceted browsers or ShEx validators.

It is not explicitly incorrect to have blank nodes in RDF data, but it is:

  1. inconsistent with the approach that Wikidata has taken (which is to avoid blank nodes)
  2. ambiguous because in RDF, blank nodes do not imply unknown values, they are simply *unidentified* nodes in the graph.

Steps to Reproduce:

SELECT ?P ?o { _:2d22892344b969be376b57170b5e495f ?p ?o }
  • Because of the semantics of SPARQL, this will try to get every triple in the database.

Remedy:
Invent a system-wide identifier for unknown values and use that Q identifier for all references to unknown value.

Event Timeline

I don't believe it's the same. T244341: Stop using blank nodes for encoding SomeValue and OWL constraints in WDQS is about the use of blank nodes in the representation of some OWL expressions. The OWL spec is very clear about saying that they have to be blank nodes so the decision was made by the OWL Working Group 15 years ago.

This issue is about the use of blank nodes to represent unknown values. The example above can be see in https://www.wikidata.org/wiki/Q313093#Q313093$761f04ee-4d7f-d725-522a-85a3077bb47b which says that Herman Boerhaave's second doctoral advisor is an unknown value but a reader of the Turtle page has no way of knowing that. Creating an entity for Wikidata-unknown-value would unify coding and make life better for queriers.

T244341 also includes the use of blank nodes for unknown values.

Oops, sorry, I failed to notice that.

I propose splitting the issue because the considerations are different and I expect them to have different outcomes.
bnodes for OWL constructs is mandated by the OWL specification.
bnodes for missing and unknown values is a subjective coding choice .

But the ultimate motivation for that task is to get rid of all blank nodes, wherever they appear, because the new query service updater can’t deal with them. If you say that we have to use blank nodes for something, then we have a big problem – please leave a comment on that task with more details.

(Side note: we don’t use blank nodes for “no value”, and I don’t think we ever did. It would be very inconvenient if FILTER EXISTS { ?item wdt:P40 ?hadAnyChild. } matched “child: no value”.)

I'll weigh in on 341 after we reach consensus here on whether our (Andra, Labra, DanBri, myself, ...) interpretation is correct that, at least for http://www.wikidata.org/entity/Q313093.ttl , a blank node represents the unknown value in https://www.wikidata.org/wiki/Q313093#Q313093$761f04ee-4d7f-d725-522a-85a3077bb47b . (Perhaps Andra can speak to "no value" cases.)

I'll weigh in on 341 after we reach consensus here on whether our (Andra, Labra, DanBri, myself, ...) interpretation is correct that, at least for http://www.wikidata.org/entity/Q313093.ttl , a blank node represents the unknown value in https://www.wikidata.org/wiki/Q313093#Q313093$761f04ee-4d7f-d725-522a-85a3077bb47b .

Yes, that’s correct at the moment. (It will eventually change due to T244341.)