Page MenuHomePhabricator

Wikidata Query Service should provide a way to retrieve all items without a statement on a certain wiki
Closed, ResolvedPublic

Description

Currently we use autolist2 to get a bunch of items in a category, combine that with a query and work on the intersection. The category might give several hundreds of items, but the query is everything with either P31 or P279 so the result is huge (for nlwiki about 1.8 million items). This makes it very slow, heavy and times out every once in a while. Take for example https://tools.wmflabs.org/autolist/?language=nl&project=wikipedia&category=Motorfietstechniek&depth=0&wdq=&pagepile=&wdqs=SELECT%20%3Fitem%20%0AWHERE%0A%7B%0A%09%3Fsitelink%20schema%3Aabout%20%3Fitem%20.%20%3Fsitelink%20schema%3AinLanguage%20%22nl%22%20%0A%20%20%20%20.%20%7B%20%3Fitem%20wdt%3AP31%20%3Fp31%20%7D%20UNION%20%7B%20%3Fitem%20wdt%3AP279%20%3F279%20%7D%0A%7D&statementlist=P&run=Run&mode_manual=or&mode_cat=and&mode_wdq=not&mode_wdqs=not&mode_find=or&chunk_size=10000

Getting pages in category tree... 263 pages found.

Getting corresponding Wikidata items... 263 items found.

Getting WDQS data... 1,871,517 items loaded.

Combining datasets...
After OR : 0 items.
After AND : 263 items.
After NOT : 251 items.
251 items in combination.

Query took 118.65857410431 seconds. 0.5 MB memory used.

The other way around would be better. Do a query to get all items that have a sitelink to some wiki, but no statements. This times out. We discussed this on irc and one solution is to add a new triple to store the number of statements (and the number of sitelinks while we're at it). That way we can just query for that new triple. For that http://wikiba.se/ontology-1.0.owl needs to be expanded.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The sitelink query could probably be made faster by T127574. Right now it's kind of hard to query by sitelink on specific site.

This works:

SELECT * WHERE {
 ?link schema:about ?item .
 ?link schema:isPartOf <https://en.wikipedia.org/> .
 FILTER NOT EXISTS {
   ?item ?p [] .
   FILTER(?p != rdfs:label && ?p != schema:description && ?p != schema:version && ?p != schema:dateModified && ?p != skos:altLabel)
 }
} LIMIT 100

But unfortunately will time out without a limit, since there's more than 400K of those and the full query would time out.

thiemowmde triaged this task as Lowest priority.Aug 31 2016, 7:31 AM
thiemowmde added a project: patch-welcome.
thiemowmde added a subscriber: Jonas.

This will be possible as soon as T129046 deploys and we reload the data.

This will be possible as soon as T129046 deploys and we reload the data.

Awesome job! Looking forward to the new possibilities.

Smalyshev claimed this task.

Possible now with:

SELECT ?item WHERE {
 ?link schema:about ?item .
 ?link schema:isPartOf <https://en.wikipedia.org/> .
 ?item wikibase:statements 0 .
}

Data is not reloaded yet so not all items will have it and also T145712 may be a problem, but the possibility is implemented, so I deem this resolved.

Thank you very much @Smalyshev . To complete this ticket, the original query:

https://tools.wmflabs.org/autolist/?language=nl&project=wikipedia&category=Motorfietstechniek&depth=0&wdq=&pagepile=&wdqs=SELECT%20%3Fitem%20WHERE%20%7B%0A%20%3Flink%20schema%3Aabout%20%3Fitem%20.%0A%20%3Flink%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fnl.wikipedia.org%2F%3E%20.%0A%20%3Fitem%20wikibase%3Astatements%200%20.%0A%7D&statementlist=P&run=Run&mode_manual=or&mode_cat=and&mode_wdq=not&mode_wdqs=and&mode_find=or&chunk_size=10000

Getting pages in category tree... 263 pages found.

Getting corresponding Wikidata items... 263 items found.

Getting WDQS data... 131 items loaded.

Combining datasets...
After OR : 0 items.
After AND : 2 items.
2 items in combination.

Query took 0.44470715522766 seconds. 0.5 MB memory used.

Much faster!