Page MenuHomePhabricator

Wikidata reference URIs have become too many to search with WDQS SPARQL
Open, LowPublic

Description

The Wikidata property P854 (reference URL) is currently being used 2,305,909 times.

Even just to produce this count takes WDQS over ten seconds.

Attempts to search for URLs containing or beginning with a particular string all fall over having timed out -- so it is impossible to do a general search for uses of a particular source.

Would it make sense to enable something like string indexing on URL-valued items, to try to make this possible again?

Event Timeline

Jheald created this task.Feb 10 2017, 4:18 PM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 10 2017, 4:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jheald added a comment.EditedFeb 10 2017, 4:41 PM

Some queries:

Count of the number of uses of the property -- takes over 10 seconds

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .
}

Attempt to extract some references to Le Figaro website -- fails -- though it does seem to try to be creating a JSON list of something

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .       
} LIMIT 200

This search for Le Figaro used to work -- back when the WDQS SPARQL service was first enabled.

Attempt to use STRSTARTS to possibly save time in the string searching -- fails -- error console appears to contain a similar broken JSON list as before

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.lefigaro.fr')) .       
} LIMIT 200

Attempt just to count the number of such instances -- still times out

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.lefigaro.fr')) .       
}

Attempt to find a meaningless string, to simulate a website used very few times -- still fails to complete

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.pqrstuvwxyz')) .       
} LIMIT 200

I don't know if something has changed since then in BlazeGraph or the WDQS set-up, or whether I am seeing this just because the number of uses of P:854 may have increased since that time; but not being able to extract at scale the items that reference particular sources makes the property and its data very much less useful.

For completeness, the P:854 can also be used as a qualifier, principally on property P1343 "described by source", though the number of uses is much much smaller.

This usage can be searched, eg this query works:

SELECT ?item ?itemLabel ?refURL WHERE {
  ?item ?prop ?stmt .
  ?stmt pq:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'//www.british-history.ac.uk/vch/')) . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
} LIMIT 200

However, that is presumably because there are only 8615 such uses in all. (live count).

For comparison, in September 2015 there were 376,524 references using P:854, and a query to search them could complete in 3.7 seconds -- it was used as an example on the query optimisation page. Such a query now times out.

It should be possible for such a search to be very efficient with proper indexing on the values of P:854. But is such indexing being sidelined by the cast to string from URL type? Would it make sense to hold URLs in both URL and string form, to make indexing more possible?

I'm not sure what you mean by "proper indexing". If you ask to go through 2.3 millions of items, it's probably going to take time, and applying operations to it is probably not covered by index. I'm not sure maybe something can be done for prefix matches (just guessing), but for operations like "contains" I don't see any way around fetching each item and applying the operation to it.

One alternative for such queries can be using LDF endpoint: https://query.wikidata.org/bigdata/ldf?subject=&predicate=http%3A%2F%2Fwww.wikidata.org%2Fprop%2Freference%2FP854&object=
and processing the matches client-side. LDF endpoint is fast, but requires the client to load a lot of data and process it.

Jheald added a comment.EditedFeb 14 2017, 12:07 PM

Hi Smalyshev, thanks for taking the time to get back to me.

The LDF suggestion is a good one -- I was thinking of looking in to it for investigating Commons sitelinks and P373s, both of which are now getting close to the borderline of what a query can cope with without timing out, with little time left for filtering. For these it might be quite useful to have all million downloaded locally. But it seems like overkill, if all one wants to do is a quick query to check on what is referencing a particular site, with maybe only a couple of hundred hits expected, or in the low thousands; but to be able to do quite general analyses on that (and share them with other people) -- eg what classes of items are showing references to this given website. I haven't used LDF, but it seems it would be turning what would be quite a simple query into quite a substantial scripting job. But it's a good suggestion.

Regarding "contains" I do take the point. Full text indexing is (I believe) technically possible using suffix trees, and I think Blazegraph even offer it as an option. As I understand it though, it could increase the storage requirement for text fields by a factor of anything up to 20. I don't know how much of the database is accounted for by text fields, but I could understand a reluctance not to go down that route.

On the other hand, I was a bit disappointed that the query wasn't helped by switching to "strstarts". As I understand it, Blazegraph does do basic indexing on all fields, which is essential for it to be able to rapidly find and retrieve a fully specified value. This is what makes it possible for a look-up for a particular url to be very very fast, eg this query

SELECT ?ref  WHERE {
  ?ref pr:P854 <http://artuk.org/discover/artists/velazquez-diego-15991660>
}

What I was hoping was that that indexing might survive (at least until any joins are done) when one identifies the unjoined values of pr:P 854 to a variable,

?ref pr:P854 ?url

If the indexing did survive, then (at least in principle) one might hope that it might then be accessible by the implementation of STRSTARTS, so that when presented with

STRSTARTS( ?url, 'http://artuk.org/discover/artists')

Blazegraph might be able to use such an index on ?url to rapidly identify the first match for the string; the first non-match after those matches; and then rapidly extract everything in between for its solution set.

Of course with string matching on URLs, there is a complication that STRSTARTS only works on strings, so that one also has to do an str() cast, and therefore to make it work the line above thus has to read

STRSTARTS( str(?url),  'http://artuk.org/discover/artists')

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

So that's what I had in mind, wondering whether it might be possible for a degree of indexing to somehow be preserved through the query execution, so that the STRSTARTS match might sometimes be very fast.

Of course you're right that for URLs, the domain part of the URL eg figaro.fr or artuk.org would almost always be part of what one would want to filter for, so if it could be possible to filter rapidly on that, then the remaining solution set would very likely be small enough to use conventional string operators. So this could be a great functionality to have, if such a thing could be implemented easily, perhaps as an extra triple on URL objects.

But I can't help thinking that this kind of STRSTARTS request must be something that comes up often enough, in enough different production contexts, that it might be worth Blazegraph's time to see whether some kind of indexing-based acceleration like the above thought-outline could sometimes be possible.

Smalyshev triaged this task as Low priority.Mar 22 2017, 10:31 PM

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

Probably not likely, since RDF triple members are not indexed directly as text strings, but instead they are converted to internal representation, which is used in the index. However, the internal representation works only for matching/comparison, not for substrings, etc. To do substring operation, the internal representation needs to be converted back to the original string via lookup. So it is not likely index will be very helpful here, unfortunately.

So you're saying, in effect, I should think of the strings being stored as a great big hash table rather than a B-tree, so there's nothing there that can help even STRSTARTS. And of course I know very little about the internals of BlazeGraph, whereas you've actually written modules for it. But I did think BlazeGraph uses B+ trees, which do specifically facilitate rapid in-order traversal and retrieval. But perhaps only to retrieve hashes, not strings?

Yes, it uses B-trees. But what is stored in these trees is not strings. It's not actually hashes, it's term values I think. Not sure about how it is encoded, but I don't see an easy way to do prefix matches there.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.May 5 2017, 3:29 PM
Nikki added a subscriber: Nikki.Nov 18 2018, 1:40 PM

I had the same problem and solved it by using the API with the generator exturlusage.