The wikidata property `P854` 'reference URL' is currently used on over 62 million references on wikidata. This introduces challenges if one wants to find eg all statements which are referenced to a particular domain.
(For example, the Library at the London School of Economics is considering changing its preferred form of URL for online theses, and wanted to find all statements referenced to a URL of the current form `etheses.lse.ac.uk`)
62 million is far too many for a SPARQL query to merely retrieve all uses of the property and then filter for a particular string: such a query has no hope of completing in 60 seconds. An alternative strategy is therefore to use Cirrus search to identify relevant item pages containing the string, and then SPARQL to identify the relevant statements within them.
Trying this strategy, Cirrus search was found to retrieve 4121 pages https://w.wiki/5RC3 , in which query https://t.co/ovNL2bZov3 finds 40951 statements were found with such a URL either as a direct value, or a qualifier value, or as a reference
However it was noticed that the approach was returning no reference URLs as references that were not also there as statement or qualifier values. https://w.wiki/5RB2 This is despite such uses being widespread -- for example query https://w.wiki/5RqQ finds 11,500 further cases where LSE thesis URLs are being referenced as references from items for LSE staff or graduates, without the URLs appearing as statement or qualifier values -- the pages for none of these items were being returned by the Cirrus search.
This was unexpected behaviour, and makes it difficult to reliably find URLs from a particular domain being used as references.
Notes.
1. When called from SPARQL the mwapi search call is limited to returning a maximum of 10,000 results. ([[https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI | MWAPI manual ]]). However as the call is only actually returning 4121 pages, we should still be well within this limit.
2. There do seem to be at least some pages where Cirrus //can// find a URL used only in a wikidata reference -- for example this query successfully retrieves a reference to a Unesco URL https://t.co/Jfwg88oYId . But seemingly not one of the pages using an LSE URL only as a reference is found.