Page MenuHomePhabricator

Allow WCQS large volume connection to MWAPI and Wikidata services
Open, MediumPublic

Description

At the moment Wikimedia Commons Query Service (WCQS) allows SPARQL queries to look up stuff on wikidata using "service <https://query.wikidata.org/sparql>" (see Example) or in Commons categories using "service wikibase:mwapi" (see example). Unfortunately those 2 services only allow a limited number of records to pass to and from. Limit of service <https://query.wikidata.org/sparql> is in thousands or tens of thousands and limit on service wikibase:mwapi is 10k (see here)

The issue is that we now have 65M files on commons and we should expect similar number of SDC items. I would like to search all the wikidata items in SDC to detect redirects. That would require to send millions of item numbers to https://query.wikidata.org/sparql service. I would also like to find all the files in a category with millions of files that do not have some SDC property. Those 2 use-cases would require much larger limits on number of records sent to those 2 services.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

While we’d love to support this use case, the limits on volume are there to ensure the stability of the service. Federation on large volumes (via SERVICE or otherwise) is problematic. It increases variability in response time, consumes idle resources on intermediate servers, and increases the likelihood of failure and the cost of retries. In general, for a synchronous service like WDQS, we want to have strong limits on resource consumption.

In the case of working through a large number of items using WDQS (for example for the detection of redirects use cases), you would need to find a way to batch those requests in smaller chunks. I don’t know the details of what is involved here, but it sounds like this is a case where offline processing via the dumps might be a better option (Wikidata / Commons). This could allow for more specialized processing that would be more efficient than relying on a generic SPARQL endpoint.

I'm keeping this open for now, so that we can continue this discussion as needed.

Those were just 2 usecases I run into lately, but great many queries we are trying to write run into this issue, as they seem to need something from "service https://query.wikidata.org/sparql" and that connection has very small bottleneck. Commons community is still mostly learning how to use WCQS, but If you look at Commons:SPARQL_query_service/queries/examples almost all examples rely on federation through a service of some sort, to connect it to Wikidata or Commons categories. A lot of those require some major changes to go around those limits. At the moment majority of my queries result in a timeout as I am searching for some combination that goes through.

I do a lot of maintenance so I often search for anomalies of some sort or constraint violations, where you look through 65M files to find a few which have some issue you are looking for. All those use cases do not run due to those limitations. I have never work with data dumps , but using them for most of your queries sounds complicated.

Gehel triaged this task as Medium priority.Sep 15 2020, 8:06 AM