Page MenuHomePhabricator

Wikidata reference URIs have become too many to search with WDQS SPARQL
Closed, DeclinedPublic


The Wikidata property P854 (reference URL) is currently being used 2,305,909 times.

Even just to produce this count takes WDQS over ten seconds.

Attempts to search for URLs containing or beginning with a particular string all fall over having timed out -- so it is impossible to do a general search for uses of a particular source.

Would it make sense to enable something like string indexing on URL-valued items, to try to make this possible again?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some queries:

Count of the number of uses of the property -- takes over 10 seconds

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .

Attempt to extract some references to Le Figaro website -- fails -- though it does seem to try to be creating a JSON list of something

  ?ref pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'')) .       
} LIMIT 200

This search for Le Figaro used to work -- back when the WDQS SPARQL service was first enabled.

Attempt to use STRSTARTS to possibly save time in the string searching -- fails -- error console appears to contain a similar broken JSON list as before

  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'')) .       
} LIMIT 200

Attempt just to count the number of such instances -- still times out

SELECT (COUNT(?ref) AS ?count) WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'')) .       

Attempt to find a meaningless string, to simulate a website used very few times -- still fails to complete

  ?ref pr:P854 ?refURL .
  FILTER (STRSTARTS(str(?refURL),'http://www.pqrstuvwxyz')) .       
} LIMIT 200

I don't know if something has changed since then in BlazeGraph or the WDQS set-up, or whether I am seeing this just because the number of uses of P:854 may have increased since that time; but not being able to extract at scale the items that reference particular sources makes the property and its data very much less useful.

For completeness, the P:854 can also be used as a qualifier, principally on property P1343 "described by source", though the number of uses is much much smaller.

This usage can be searched, eg this query works:

SELECT ?item ?itemLabel ?refURL WHERE {
  ?item ?prop ?stmt .
  ?stmt pq:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'//')) . 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
} LIMIT 200

However, that is presumably because there are only 8615 such uses in all. (live count).

For comparison, in September 2015 there were 376,524 references using P:854, and a query to search them could complete in 3.7 seconds -- it was used as an example on the query optimisation page. Such a query now times out.

It should be possible for such a search to be very efficient with proper indexing on the values of P:854. But is such indexing being sidelined by the cast to string from URL type? Would it make sense to hold URLs in both URL and string form, to make indexing more possible?

I'm not sure what you mean by "proper indexing". If you ask to go through 2.3 millions of items, it's probably going to take time, and applying operations to it is probably not covered by index. I'm not sure maybe something can be done for prefix matches (just guessing), but for operations like "contains" I don't see any way around fetching each item and applying the operation to it.

One alternative for such queries can be using LDF endpoint:
and processing the matches client-side. LDF endpoint is fast, but requires the client to load a lot of data and process it.

Hi Smalyshev, thanks for taking the time to get back to me.

The LDF suggestion is a good one -- I was thinking of looking in to it for investigating Commons sitelinks and P373s, both of which are now getting close to the borderline of what a query can cope with without timing out, with little time left for filtering. For these it might be quite useful to have all million downloaded locally. But it seems like overkill, if all one wants to do is a quick query to check on what is referencing a particular site, with maybe only a couple of hundred hits expected, or in the low thousands; but to be able to do quite general analyses on that (and share them with other people) -- eg what classes of items are showing references to this given website. I haven't used LDF, but it seems it would be turning what would be quite a simple query into quite a substantial scripting job. But it's a good suggestion.

Regarding "contains" I do take the point. Full text indexing is (I believe) technically possible using suffix trees, and I think Blazegraph even offer it as an option. As I understand it though, it could increase the storage requirement for text fields by a factor of anything up to 20. I don't know how much of the database is accounted for by text fields, but I could understand a reluctance not to go down that route.

On the other hand, I was a bit disappointed that the query wasn't helped by switching to "strstarts". As I understand it, Blazegraph does do basic indexing on all fields, which is essential for it to be able to rapidly find and retrieve a fully specified value. This is what makes it possible for a look-up for a particular url to be very very fast, eg this query

  ?ref pr:P854 <>

What I was hoping was that that indexing might survive (at least until any joins are done) when one identifies the unjoined values of pr:P 854 to a variable,

?ref pr:P854 ?url

If the indexing did survive, then (at least in principle) one might hope that it might then be accessible by the implementation of STRSTARTS, so that when presented with

STRSTARTS( ?url, '')

Blazegraph might be able to use such an index on ?url to rapidly identify the first match for the string; the first non-match after those matches; and then rapidly extract everything in between for its solution set.

Of course with string matching on URLs, there is a complication that STRSTARTS only works on strings, so that one also has to do an str() cast, and therefore to make it work the line above thus has to read

STRSTARTS( str(?url),  '')

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

So that's what I had in mind, wondering whether it might be possible for a degree of indexing to somehow be preserved through the query execution, so that the STRSTARTS match might sometimes be very fast.

Of course you're right that for URLs, the domain part of the URL eg or would almost always be part of what one would want to filter for, so if it could be possible to filter rapidly on that, then the remaining solution set would very likely be small enough to use conventional string operators. So this could be a great functionality to have, if such a thing could be implemented easily, perhaps as an extra triple on URL objects.

But I can't help thinking that this kind of STRSTARTS request must be something that comes up often enough, in enough different production contexts, that it might be worth Blazegraph's time to see whether some kind of indexing-based acceleration like the above thought-outline could sometimes be possible.

But if the URLs were already indexed using eg a basically alphabetical B-tree, it might be possible to do the cast to string and still preserve an indexing, so it might still be possible to execute the line above with indexing, and for it therefore to still be very fast.

Probably not likely, since RDF triple members are not indexed directly as text strings, but instead they are converted to internal representation, which is used in the index. However, the internal representation works only for matching/comparison, not for substrings, etc. To do substring operation, the internal representation needs to be converted back to the original string via lookup. So it is not likely index will be very helpful here, unfortunately.

So you're saying, in effect, I should think of the strings being stored as a great big hash table rather than a B-tree, so there's nothing there that can help even STRSTARTS. And of course I know very little about the internals of BlazeGraph, whereas you've actually written modules for it. But I did think BlazeGraph uses B+ trees, which do specifically facilitate rapid in-order traversal and retrieval. But perhaps only to retrieve hashes, not strings?

Yes, it uses B-trees. But what is stored in these trees is not strings. It's not actually hashes, it's term values I think. Not sure about how it is encoded, but I don't see an easy way to do prefix matches there.

I had the same problem and solved it by using the API with the generator exturlusage.

I suggest closing this since Blazegraph is tech-debt by now and Michael provided a solution.

Gehel claimed this task.
Gehel added a subscriber: Gehel.

closed as suggested @So9q

I'm reopening this because the exturlusage generator is only an option for some cases, it's not a general solution.

As stated in the documentation, the MediaWiki API search in SPARQL is limited to 10,000 results so it's not possible to use it for anything where your query would match more than that.

The external links search is also rather limited, it requires the protocol (it uses "http" if you don't include it) and only accepts wildcards at the beginning of the hostname. You can't, for example, find all URLs starting with mailto:, explicitly containing port 80 or all archived pages from

Substring matching is really not something that should be done in Blazegraph. Alternative solutions have been proposed, let's close this ticket.

Substring matching is really not something that should be done in Blazegraph. Alternative solutions have been proposed, let's close this ticket.

Could you elaborate on what alternative solutions you're referring to?