Maniphest T120166

Semantically define arity of statement -> reference relations
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Christopher
	Dec 3 2015, 5:21 AM

Description

This is data model and RDF serialization problem.

The primary use case is for measuring and evaluating "unreferenced statements", a nullary relationship that dominates the data set. (See T117234)

Since there are no attributes/properties in the data model/ontology to represent the arity of statement to reference relationships, querying for this property is not currently possible with SPARQL.

See http://www.w3.org/TR/swbp-n-aryRelations/ for recommendations on implementation.

Related Objects
Search...

Status	Assigned	Task
Resolved	Addshore	T119182 Track Wikidata data completeness
Resolved	Christopher	T117234 Reproduce wikidata-todo/stats data using analytics infrastructure
Resolved	Smalyshev	T120166 Semantically define arity of statement -> reference relations

Event Timeline

Christopher created this task.Dec 3 2015, 5:21 AM

Christopher raised the priority of this task from to Needs Triage.

Christopher updated the task description. (Show Details)

Christopher added projects: Wikidata, Wikidata-Query-Service, Wikibase-DataModel.

Christopher subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 3 2015, 5:21 AM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Christopher added a parent task: T117234: Reproduce wikidata-todo/stats data using analytics infrastructure .Dec 3 2015, 5:22 AM

Lydia_Pintscher added subscribers: Lydia_Pintscher, daniel.Dec 3 2015, 11:38 AM

Christopher mentioned this in T117234: Reproduce wikidata-todo/stats data using analytics infrastructure .Dec 3 2015, 12:38 PM

Arity 0, 1 and "more than one" are reasonably easily accessible, at least for smaller subsets of the data.

For arity 0, the FILTER NOT EXISTS { ... } or MINUS { ... } or the OPTIONAL { ... } FILTER (!bound( ... )) constructions can all be used, at least on smaller subsets, to get a set of items for which something is not true.

Obviously this can be inefficient for a dataset where arity 0 is dominant; but an alternative approach might just be to count the statements that are referenced, and subtract from the total number of statements.

The number of referenced statements could in principle be obtained from a query like:

PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT (COUNT(DISTINCT ?stmt) AS ?count) WHERE {  
   ?stmt prov:wasDerivedFrom ?ref 
}

though you would need to allow the query more time (and maybe more memory) than the default WDQS service.

A set of items with exactly one value for a property can be obtained by looking for a second item, and then excluding any item for which it is present:

?item wdt:PROP ?value1
OPTIONAL {
    ?item wdt:PROP ?value2 .
    FILTER (!sameTerm(?value1, ?value2))
}
FILTER (!bound(?value2))

A breakdown of the incidence-frequencies for larger numbers of different values can be obtained by using a sub-select:

SELECT ?nvals (COUNT(DISTINCT(?item)) AS ?count) WHERE {
  {
    SELECT ?item (COUNT(DISTINCT(?value)) AS ?nvals)
    WHERE {
        ?item wdt:PROP ?value
    }  GROUP BY ?item
  }
} GROUP BY ?nvals
ORDER BY ?nvals

Whole-database statistics like the total reference counts may be an exception; but for most smaller specific parts of the dataset, I would have thought it makes more sense to obtain counts of multivaluedness by querying than by actively storing and maintaining a record of the arity in the triplestore.

For example, here is a query to give break down the number of films by the number of values of cast member (P161) that we have for them:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?nvals (COUNT(DISTINCT(?item)) AS ?count) WHERE {
  {
    SELECT ?item (COUNT(DISTINCT(?value)) AS ?nvals)
    WHERE {
        ?item wdt:P31/wdt:P279* wd:Q11424 .
        ?item wdt:P161 ?value
    } GROUP BY ?item
  }
} GROUP BY ?nvals 
ORDER BY ?nvals

... and the number of films with no cast member listed (query):

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT (COUNT(DISTINCT(?item)) AS ?count) WHERE {
   ?item wdt:P31/wdt:P279* wd:Q11424 .
   FILTER NOT EXISTS {?item wdt:P161 ?value}
}

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.Dec 3 2015, 3:32 PM

@Jheald Thank you for your suggestions. What is fairly clear in my research is that counting type queries on large (or undefined ranges) with an unbound domain are just not possible (without huge resource consumption) when the namespace contains millions and millions of triples. For example, the

PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT (COUNT(DISTINCT ?stmt) AS ?count) WHERE {  
   ?stmt prov:wasDerivedFrom ?ref 
}

will not work, even with no query timeout. I have tried it on http://wdm-rdf.wmflabs.org and it uses all of the 8GB heap spaces and crashes Blazegraph. Of course, there are ways to use SPARQL to post-process/filter manageable result sets (in memory) as you suggest, but this seems not possible for the 800M+ triples in wdq.

By introducing an "arity class property" (like "hasNullReference"), the evaluation on all data, can be achieved with minimal processing overhead because the query range is a boolean value and not a variable like "all references" .

Quick edit: I ran this query successfully in 13min, 11sec, 476m returning 312,068 results returning the arity of GND (P227) property statements. So it is possible, but really, really slow.

prefix wikibase: <http://wikiba.se/ontology#>
prefix wdt: <http://www.wikidata.org/prop/direct/>
prefix prov: <http://www.w3.org/ns/prov#>
prefix wd: <http://www.wikidata.org/entity/>
prefix p: <http://www.wikidata.org/prop/>

SELECT ?wds (count(distinct(?o)) AS ?ocount) WHERE {
  ?s p:P227 ?wds .
  ?wds a wikibase:Statement
  OPTIONAL {
  ?wds prov:wasDerivedFrom ?o
  }     
} GROUP BY ?wds

Yeah, 13 min queries is not really the best idea I'm afraid. Also, ?wds a wikibase:Statement should not have worked on query.wikidata.org since it strips those (see https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences) but can be run from raw dump of course. In fact, you do not need the "a statement" part since only statements are ever on the right of p: predicates anyway.

I also don't think distinct is needed in the last query since having the same reference twice it pretty rare I think. And, "OPTIONAL" part may be omitted too maybe since if you enumerate all statements and then remove the ones with non-zero counts, you get the ones with zero counts (e.g. MINUS operator could do it).

With these modifications, query like:

prefix wikibase: <http://wikiba.se/ontology#>
prefix wdt: <http://www.wikidata.org/prop/direct/>
prefix prov: <http://www.w3.org/ns/prov#>
prefix wd: <http://www.wikidata.org/entity/>
prefix p: <http://www.wikidata.org/prop/>

SELECT ?wds (count(?o) AS ?ocount) WHERE {
  ?s p:P227 ?wds .
  ?wds prov:wasDerivedFrom ?o .
} GROUP BY ?wds

runs for me in 26 s. Of course, I may be missing something here.

In general, the query service may not be very suited for queries that require touching whole or significant part of the database, they will be slow. Going over 300K+ entities one by one has to take some time.

I am not sure why we would want to list out all the statements. Surely we just want to count them ?

The query below runs in 8.4 seconds

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wd: <http://www.wikidata.org/entity/> 
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX prov: <http://www.w3.org/ns/prov#>
prefix p: <http://www.wikidata.org/prop/>

SELECT ?nrefs (COUNT(?wds) AS ?count) WHERE {
  {
    SELECT ?wds (COUNT(DISTINCT(?ref)) AS ?nrefs)
    WHERE {
        ?item p:P227 ?wds .
        ?wds prov:wasDerivedFrom ?ref .
    } GROUP BY ?wds
  }
} GROUP BY ?nrefs 
ORDER BY ?nrefs

while just counting the number of distinct referenced statements runs in 4 seconds.

Of course the latter is the exact same query that Christopher found crashed Blazegraph, when he ran it for the full 800 million triples, rather than just ~400,000 statements.

But I note that the problem that actually killed the query was that the heap space got exhausted.

Is this the kind of thing that Blazegraph's "AnalyticQuery" mode is designed to help with ?

@Jheald Perfect. This works, even with adding optional it runs in 10 seconds. Yea, definitely outputting the statements is unnecessary and adds a lot of time.

Total results: 5, duration: 10445 ms
nrefs	count
0	39775
1	339700
2	10050
3	382
4	14

Conclusion: Faceting the namespace by property (and avoiding unnecessary output processing) is a practical way to get this data. Thanks again.

Lydia_Pintscher triaged this task as Lowest priority.Dec 18 2015, 10:25 AM

Lydia_Pintscher set Security to None.

• Deskana moved this task from Needs triage to WDQS on the Discovery-ARCHIVED board.Dec 29 2015, 10:14 PM

@Christopher is there anything else needing to be done with this?

@Smalyshev no, I think that this specific issue has been practically resolved.

Smalyshev closed this task as Resolved.Feb 9 2016, 6:39 AM

Smalyshev claimed this task.

Semantically define arity of statement -> reference relationsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Semantically define arity of statement -> reference relations
Closed, ResolvedPublic
Actions

Related Objects
Search...