Page MenuHomePhabricator

Cirrus search appears not to be indexing reference URLs from wikidata
Open, MediumPublicFeature

Description

The wikidata property P854 'reference URL' is currently used on over 62 million references on wikidata. This introduces challenges if one wants to find eg all statements which are referenced to a particular domain.

(For example, the Library at the London School of Economics is considering changing its preferred form of URL for online theses, and wanted to find all statements referenced to a URL of the current form etheses.lse.ac.uk)

62 million is far too many for a SPARQL query to merely retrieve all uses of the property and then filter for a particular string: such a query has no hope of completing in 60 seconds. An alternative strategy is therefore to use Cirrus search to identify relevant item pages containing the string, and then SPARQL to identify the relevant statements within them.

Trying this strategy, Cirrus search was found to retrieve 4121 pages https://w.wiki/5RC3 , in which query https://t.co/ovNL2bZov3 found 40951 statements with such a URL either as a direct value, or a qualifier value, or as a reference

However it was noticed that the approach was returning no reference URLs as references that were not also there as statement or qualifier values. https://w.wiki/5RB2 This is despite such uses being widespread -- for example query https://w.wiki/5RqQ finds 11,500 further cases where LSE thesis URLs are being referenced as references from items for LSE staff or graduates, without the URLs appearing as statement or qualifier values -- the pages for none of these items were being returned by the Cirrus search.

This was unexpected behaviour, and makes it difficult to reliably find URLs from a particular domain being used as references.

Notes.

  1. When called from SPARQL the mwapi search call is limited to returning a maximum of 10,000 results. (MWAPI manual). However as the call is only actually returning 4121 pages, we should still be well within this limit.
  2. There do seem to be at least some pages where Cirrus can find a URL used only in a wikidata reference -- for example this query successfully retrieves a reference to a Unesco URL https://t.co/Jfwg88oYId . But seemingly not one of the pages using an LSE URL only as a reference is found.

Event Timeline

MPhamWMF changed the subtype of this task from "Bug Report" to "Feature Request".Aug 1 2022, 8:34 PM
MPhamWMF subscribed.

talk to lydia. part of wd model is not entirely index: values or properties and properties. can't ask for all wiki articles with references from nyt. structured data needed

Part of the expectation of an RDF-based system is that it should be easy to retrieve URLs of a particular form.

It's not appropriate to think or expect the community will index by hand things that directly lend themselves to be indexed by machine - such as fragments of URLs, in particular their domain parts.

There's simply no way that trying to index the domains of these URLs in a community-driven way would be as accurate or as comprehensive as automatic indexing -- and it would be pure makework. This is what computers are for.

There is too much of a mountain of work to do or to fix on wikidata that actually requires human judgment, to imagine the community is going to waste its time and divert resources to indexing URLs or extracting domain parts when this is so straightforward and done so much better automatically.

Blazegraph (like most triplestores) actually comes with the option to turn on full-text indexing for URLs. But this was not done, because, we were told, full-text indexing would be done so much more efficiently by Cirrus, which was already activated. Now it turns out, that was only actually true in certain areas.

It's not unreasonable to want to be able to ask how material from a particular source is being used -- and SPARQL should be a perfect tool for doing such analyses. Being able to retrieve references with URLs of a particular type by full text indexing would be best. But if that is not going to be possible, can I suggest at least adding extra triples to the triplestore for the domain part of URLs -- so at least it would be possible to pull out the URLs from a particular domain quickly (and almost instantaneously to count them. That at least would open the door to making more complicated requirements possible.

talk to lydia. part of wd model is not entirely index: values or properties and properties. can't ask for all wiki articles with references from nyt. structured data needed

Ooops. please ignore this comment. I was taking notes and this accidentally got saved as a comment and I can't seem to delete it