Page MenuHomePhabricator

Investigate how blank nodes are used and synced between wikibase and wdqs
Open, Needs TriagePublic

Description

When testing the new merging updater we found that the number of triples dropped significantly (T231411#5674890).
One reason invoked was that the way we sync blank nodes has changed. We already know that some are duplicated in the triple store (T231515) and due to their nature (unstable unique ID) such blank nodes are hard to keep in sync with the triple store.
This ticket is about tracking how blank nodes are used in the rdf output from wikibase and make sure that we do not duplicate them during the update process.

Blank used to denote "unknown value" in wikidata

and never used as subject.

Blank node only used as an object of statement qualifier:

s:Q233-a9844587-4029-33bc-7b34-13b0d3c10ed3 a wikibase:Statement,
		wikibase:BestRank ;
	wikibase:rank wikibase:NormalRank ;
	ps:P138 wd:Q10987 ;
	pq:P407 _:genid1 .

from: https://www.wikidata.org/wiki/Special:EntityData/Q233.ttl

(does not seem to be duplicated currently)

As statement values:

s:Q17619314-5cd290f5-4659-e699-74b9-52714a955c62 a wikibase:Statement,
		wikibase:BestRank ;
	wikibase:rank wikibase:NormalRank ;
	ps:P268 _:genid4 ;
	pq:P813 "2016-03-14T00:00:00Z"^^xsd:dateTime ;
	pqv:P813 v:bcddb148b45928cdcf857b69eeb88df9 .

from: https://www.wikidata.org/wiki/Special:EntityData/Q17619314.ttl
(does not seem to be duplicated currently).

But they don't seem to link to same anonymous bnode when used as ps and wdt objects (T239397).

Some usecases that use such blank nodes

Blank nodes used to indicate owl constraints on properties
wdno:P3418 a owl:Class ;
	owl:complementOf _:genid1 .

_:genid1 a owl:Restriction ;
	owl:onProperty wdt:P3418 ;
	owl:someValuesFrom owl:Thing .

from https://www.wikidata.org/wiki/Special:EntityData/P3418.ttl

The ones are not properly synced and are duplicated (T231515).
But again, are they really useful on the triple stores, these constraints seem to be always the same and since we do not use any inference engine nor constraint checks do we really need to sync them?

Find others

We should investigate other uses of blank nodes by extracting all of them from the triple store using this query:

select ?p (count(*)as ?cnt) {
  ?s ?p ?o .
  filter (isBlank(?o))
}
group by ?p

Details

Related Gerrit Patches:
wikidata/query/rdf : masterIncrease the max allowed query timeout

Event Timeline

dcausse created this task.Nov 28 2019, 1:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2019, 1:39 PM
dcausse updated the task description. (Show Details)Nov 28 2019, 1:40 PM
dcausse added a subscriber: Igorkim78.
dcausse updated the task description. (Show Details)Nov 28 2019, 1:55 PM

Change 553757 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Increase the max allowed query timeout

https://gerrit.wikimedia.org/r/553757

select ?p (count(*)as ?cnt) {
  ?s ?p ?o .
  filter ((!isLiteral(?o))&&(!isUri(?o)))
}
group by ?p

This can be simplified using isBlank, I believe:

SELECT ?p (COUNT(*) AS ?cnt) WHERE {
  ?s ?p ?o.
  FILTER(isBlank(?o))
}
GROUP BY ?p

But in general, are these triples interesting in wdqs? Since they're never used as subject there are no way to use them directly, it seems the only thing that we can do is to display them (T173248).

I’m not sure what you mean, to be honest… but they’re used in some useful queries, e. g. heads of state and government (“end time” qualifier absent or unknown) or anonymous paintings (unknown creator). You can find more by searching for isBlank on Wikidata project pages (most results use !isBlank to filter out blank nodes, but there are some non-negated uses too).

Blanks are used as representation of "unknown value" in Wikibase. Also, they are used (completely unrelatedly) in implementation of OWL class that describes "no value" properties (as "no value" is not a value, it is implemented with a predicate and a class instead of usual property predicate).
The class definition itself is useful only for the tools that actually understand OWL semantics. Most tools that query WDQS do not.

dcausse added a comment.EditedDec 2 2019, 8:06 AM

But in general, are these triples interesting in wdqs? Since they're never used as subject there are no way to use them directly, it seems the only thing that we can do is to display them (T173248).

I’m not sure what you mean, to be honest… but they’re used in some useful queries, e. g. heads of state and government (“end time” qualifier absent or unknown) or anonymous paintings (unknown creator). You can find more by searching for isBlank on Wikidata project pages (most results use !isBlank to filter out blank nodes, but there are some non-negated uses too).

Thanks for the links, I was not aware of these usecases. As to what I meant when saying that these triples are not interesting, I did not think that they could be used as filters like that, what I understand about blank nodes is that they are useful to link triples using a resource that cannot be named. In this case they do not link anything and we seem to just use the fact that they are "blank" to determine/filter something. Perhaps an approach similar to wdno would have worked here? Anyways, this is certainly not worthwhile reconsidering this at this point and I'm certainly lacking a lot of context as to why such decisions were made.

Also I realize that this ticket is lacking some context, I will amend it to mention the reasons why we are investigating bnodes.

dcausse updated the task description. (Show Details)Dec 2 2019, 8:17 AM
dcausse updated the task description. (Show Details)Dec 2 2019, 8:22 AM

Change 553757 merged by jenkins-bot:
[wikidata/query/rdf@master] Increase the max allowed query timeout

https://gerrit.wikimedia.org/r/553757

We need statistics on how many triples use bnode as an object:
{code}
select ?p (count(*)as ?cnt) {

?s ?p ?o .
filter (isBlank(?o))

}
group by ?p
{code}
and as a subject (if any)
{code}
select ?p (count(*)as ?cnt) {

?s ?p ?o .
filter (isBlank(?s))

}
group by ?p
{code}

dcausse updated the task description. (Show Details)Dec 3 2019, 4:39 PM
dcausse updated the task description. (Show Details)

P9859 contains the output of

select ?p (count(*)as ?cnt) {
  ?s ?p ?o .
  filter (isBlank(?o))
}
group by ?p

ran on wdqs2006

Will run the same query but with a filter on the subject as asked, expectations here are to find only owl:complementOf around 42K.

select ?p (count(*)as ?cnt) {
  ?s ?p ?o .
  filter (isBlank(?s))
}
group by ?p

output is at P9862

and as expected we only see the corresponding subjects of the owl constraint on owl:complementOf (rdf:type, owl:onProperty and owl:someValuesFrom) as exported by the wikibase today:

wdno:P31 a owl:Class ;
	owl:complementOf _:genid1 .

_:genid1 a owl:Restriction ;
	owl:onProperty wdt:P31 ;
	owl:someValuesFrom owl:Thing .

@Igorkim78 could you take a look?