Page MenuHomePhabricator

Figure out the future of Wikimedia Commons Query Service (WCQS)
Open, Stalled, Needs TriagePublic

Description

Wikimedia Commons Query Service (WCQS) is a SPARQL endpoint serving the Structured Data on Commons. It is currently seeing very low level of traffic (<1 queries/min with few spikes). We suspect that authentication is one of the major reason for the lack of adoption (see T297995).

We need to figure out what the future of this service is. Operating a service that is mostly unused is not a good use of our resources. We should either fix it, or decommission it.

A few random notes:

  • Removing authentication without putting in place either a robust backend or a robust way to rate limit is likely to create an unmaintainable service
  • Data growth is ~1.5 Billion triples / year. This is likely to become a problem in the next 5 years on the current infrastructure.
  • It is unclear what the use cases are for Structured Data on Commons. WCQS seems to be the main way to access that data (outside of dumps and browsing Commons). Different data consumption endpoints might be more useful than a SPARQL endpoint.
  • The experience on WDQS shows that any attempt to scale the service (split, backend replacement) is prone to breaking changes which might cause a lot of frustration and additional work from data-reusers (query rewrite)

Please add thoughts and use cases below.

Event Timeline

Just remove the damn authentication. T297995 has been open for three years. You have been running a quite stable, but also quite useless service for the past three years. It's extremely demotivating that I and other volunteers spend time on importing and adding structured data, but I can't properly query it. It's the primary reason why I haven't really added a lot of new things over the last years.

So please be brave and take a decision before the end of the year. Either:

  1. Remove the stupid authentication
  2. Shutdown the useless service

Of course I would prefer option 1.

The more you delay this, the more the community will loose faith in the search team. My expectation is more procrastination, I would like to be wrong.

WCQS seems to be the main way to access that data

There are multiple other ways to access structured data:

Note other wikis are currently not able to use such data (T238798: Allow access to MediaInfo from wikis other than Commons)

Note that the Search Platform team is only the technical maintainer of this service. We're not in a position to make any decision on how it should be used or how it should evolve. What we can say, is that it seems to be a very risky proposition to remove authentication without having a medium term plan on how to support this service. To make any progress, there need to be a longer term strategy.

The community has already stated three years ago that an endpoint with authentication would see very little traction because it just won't be used for building tools (because of the authentication hurdle). Obviously running it for three years and the little traffic it has received proves exactly that point.

I agree with Maarten here that either the authentication should be removed or the thing should be taken down. CirrusSearch is nice for external access to the data, but very limited compared do what you can do with SPARQL. Also, seeing the amount of WDQS use also shows that there *is* a lot of need for a service like this. But yeah, if the search team has little interesting in actually supporting WCQS or figuring out how to make it run without authentication (i would think three years is more than enough to create a long term strategy) than i'm not really sure how to proceed in creating a path forward for this service.

@Gehel you created this task two months ago. It's clear what the community wants. What are you going to do?

I hadn't commented here before today because it is very de-motivating in situations such as this when you're being asked by the WMF to give the same feedback in 5 different places across the last 3 years and to multiple different WMF teams—even when it has already been made fairly clear that outcome that is clearly preferred by the community is not even on the table anyway. I agree with what essentially everyone else has ever said to the WMF on this matter. WCQS is only really useful insofar as it is reliable and accessible by bots over the endpoint. I am someone who has uploaded over 5 million image files to Commons and maintains tens of millions of SDC statements across all of them, and would have liked to have used the query service for very real applications in analyzing and synchronizing the data as part of this workflow. Instead, I long ago had to figure out ways to totally work around it, and now no longer think about using WCQS much. My other main use case is making reports for analytics purposes, but I literally cannot link anyone who is not already a logged-in Wikimedian to results from this service or they will be blocked from accessing the data. I can't make a third-party dashboard, since I am also prevented from, e.g., querying the data with some JS on another site.

As it stands, this is sadly another case where the WMF seems to have developed a new product 95% of the way, publicly launched it unfinished, and then stopped active development once it received feedback. And was just today told, literal years after it was released out of beta and the SLO was published, that this is still regarded as a beta product with no SLO. Yes, as Multichill mentioned in another ticket, there aren't really any users, but that's because it's not really usable as is. We were made to believe that authentication was necessary to "prevent query abusers from inundating query traffic to the point where other users are unable to access or experience long wait times". In reality, there are now no users, much less abusers, and still the service falls over on its own—and no one even notices for days when it does so. So, this is a case where the solution is worse than the problem.

All of this is beside the point that there never seems to have been any real interest in practical solutions anyway. The "authentication" issue is a misnomer anyway. As we are told by the WMF's documentation:

There is currently no easy way to access the cookie programmatically, and it must be done manually from a web browser after visiting https://commons-query.wikimedia.org.

The wcqsOauth cookie is provided in a Set-Cookie header by WCQS after authenticating the user against mediawiki. Unfortunately, there are no API's that expose this value.

...which is the ultimate problem, in my mind. Not the authentication itself, but that by making a system which both requires authentication and provides the user no way to actually get the authentication token. It feels very strongly like no one was really caring or wanting the service to be used from the beginning. The issue is not the authentication alone, but that the service is using authentication to actively block users. So it is hard to take the question seriously right now, because if there was any real interest in the tool being used, with authentication, as it is apparently currently intended, then there would have been a way by now developed to actually allow us to realistically authenticate. The situation now is somewhat like blocking IP editing to Wikipedia before you have launched Special:UserLogin, and then seeking comment from the community about the future of editing and why edits are down. Let me just say, I would still very much like a functional version of WCQS. But I can't continue spending my time on advocating for it when we are not being engaged with in good faith, but basically just all having our time wasted.

At the time of launch it was decided during community consultation that the service would remain in beta until the authentication would be removed. It seems like that consultation was forgotten and here we go again.

Our expectations are low, remove the authentication and if it fails catastrophically you at least tried.

I can't speak for the project itself, but I am following the Biodiversity Heritage Library's efforts to add reliable structured data about depicted species to illustrations on Wikimedia Commons. The Smithsonian has recently invested in @TiagoLubiana as a Wikimedian in Residence and @Ambrosia10 is also doing a lot of effort on it in her own time.

As per this paper written about the purpose of the effort, this work will be massively useful for the international scientific community that researches biodiversity changes over time and how it's affected by human intervention, climate change and the likes.

But it will only be usable and impactful at scale to the scientific community if we provide the infrastructure to access the data. A SPARQL endpoint is a standardized and very usable way to do so. And it should be without authorization so that scientists and partners can build data pipelines, import and update mechanisms, and other integrations and tools.

Yes, I second what @Spinster shared here: the SPARQL endpoint is extremely important for the BHL-Wiki collaboration.

And I also second @Abbe98 – removing the need for authentication,is something that should be done at the very least to inform decisions on this ticket.

If the future maintenance is something that worries the team, it could even be framed as an experiment for, say, 6 months or 1 year.

Data growth is ~1.5 Billion triples / year. This is likely to become a problem in the next 5 years on the current infrastructure.

@Gehel just out of curiosity. What is current infrastructure? (ie. what kind of hardware the service is running and how it is comparing to WDQS?

Also note, as known Blazegraph is unmaintained (T206560 ) so the current software stack is by definition replaced in 5 years in any case.

@Gehel just out of curiosity. What is current infrastructure? (ie. what kind of hardware the service is running and how it is comparing to WDQS?

We are currently using the same server specs as WDQS (which simplifies procurement and operations). Those are Intel Xeon Silver 4215 CPU @ 2.50GHz (16 cores / 32 threads), 128G of RAM, 4x2T SSD. CPU is underused at the moment (see grafana), but the current load does not reflect the potential for the service.

It is unclear what the use cases are for Structured Data on Commons. WCQS seems to be the main way to access that data (outside of dumps and browsing Commons). Different data consumption endpoints might be more useful than a SPARQL endpoint.

My main use cases are:

  • downloading the latest data for local use (i.e., fetching all statements from SDC for images with certain properties or values, usually from Finna.fi)
  • tracking which photos already have certain properties
  • in fiwiki, using WCQS for finding images for articles using depicts statements (particularly in Women in Red competition and pages without photos). This is done using federated queries, which can be quite complex.

For access methods, we are using the web UI and Pywikibot's SPARQL functions, which handle some of the authentication complexity automatically.
In the future, I would also like to use the WCQS endpoint for downloading specific data to external tools such as Jena Fuseki. For example, I would use SPARQL to first fetch latest data from Wikidata, OSM, and WCQS to a local Jena Fuseki instance, and then perform combining SPARQL queries locally for the performance reasons.

I would also like to emphasize that the ability to query the data freely is the key element for the SDC use. Changing the WCQS to a REST API with limited API, for example, would not be a usable solution, and I think it would limit crowdsourcing/enriching tasks in general, not just in our case.

The experience on WDQS shows that any attempt to scale the service (split, backend replacement) is prone to breaking changes which might cause a lot of frustration and additional work from data-reusers (query rewrite)

My vote is that if i need to choose between query rewrites or authentication i would choose the query rewrites. One of the key elements is to be able to do federated queries between SPARQL endpoints and current authentication requirement prevents it.

Thanks Gehel for making this an explicit issue.

I agree strongly with @Multichill, @Husky, abbe, dominic, and tiago.

We've worked for years to get structured data in; the community (including the extremely prolific people commenting in this thread) has long wanted auth-free ways to query it. @Dominicbm is right in saying we spent most of the effort getting things 95% done, and at that point it stopped being a priority and feels like no one really cared to have the service used. Let's not leave that as the history of the initiative!

Better to remove auth to see how it gets used, than to leave it closed waiting for an nth consultation on potential use cases.
The service is in beta; it will still be after such an auth change; everyone understands that queries may have to be rewritten when the backend changes (but this can also be flagged to users).

"We should either fix it, or decommission it."

WCQS is in beta; make clear that the authless services is still in beta, and/or that any queries may have to be rewritten after a future backehdn migration.

Part of the reason this feels silly to me, is this dataset should be significantly easier to serve than Wikidata.

  • The number of users interested in it is almost certainly going to be less. Possibly multiple orders of magnitude lower. This is a specialized dataset vs wikidata which is very general. The audience is inherently limited.
  • The query pattern is likely to be much easier to serve. I would expect your average WCQS query to have significantly less hops on average than your average wikidata query. The size of intermediate results, on average should be much much smaller based on the type of data being stored here. The schema is much more star-like than wikidata's tangled web is.
  • The entire dataset is an order of magnitude smaller currently. It is growing rapidly now, but the expectation should be that the rate of growth will slow down once the backlog of images have their metadata added

This should be trivial compared to the challenge that is Wikidata.

As the proverb goes: The safest place for a boat is in the harbour, but that is not what boats are for.

This is so sad! But I agree with the "We should either fix it, or decommission it." It was never going to work with authentication because you couldn't use it in pages for anyone unfamiliar with logging in to it. After three whole years of waiting, I agree with "Our expectations are low, remove the authentication and if it fails catastrophically you at least tried." Assuming it will never be unlocked because of these scalability fears that keep being reported, perhaps we should start talking about how to migrate the data to the "front page" file template to enable multi-language search?

Just as a note that Intel Xeon Silver 4215 CPU's End of Servicing Updates Date is June 30, 2025. (source). It doesn't mean that it would stop working, but my guess is that it will be a start for deprecating it from infrastructure.

Assuming it will never be unlocked because of these scalability fears that keep being reported, perhaps we should start talking about how to migrate the data to the "front page" file template to enable multi-language search?

I think that alternative for me would be that we would start to seek how community can run the service for themselves. Ie. howto setup local SPARQL endpoint which would have realtime synced copy of Wikimedia Commons SDC data. Somebody could try to run big public service, but for those who need higher performance they could run their own personal endpoints.

From technical point of view, howto replicate WCQS data to local endpoints. It is not that complex to setup SPARQL endpoint which updates its data peridiocally from https://dumps.wikimedia.org/commonswiki/entities/ . The tricky part would be the implementing the realtime updates.

From technical point of view, howto replicate WCQS data to local endpoints. It is not that complex to setup SPARQL endpoint which updates its data peridiocally from https://dumps.wikimedia.org/commonswiki/entities/ . The tricky part would be the implementing the realtime updates.

Note that T294133 tracks exposing the RDF update stream from Wikidata. The stream is available. There is some work to be done to implement code to consume that stream (T374939 / T374944).

I had not commented here yet because I have commented elsewhere (some wikipage I think) before ; and others have made my points better than I would (in particular, @Dominicbm has very well put how tiring this periodic asking that goes nowhere is).

Something that has not been mentioned here (I think) is that QLever supports the Wikimedia Commons dataset (although it’s 2 months old at time of writing). Speaking for my tool, at this point, I had pretty much lost all hope for T294893: integraality for Structured Data on Commons? but T385749: Support qlever endpoint for integraality might be the path forward

Gehel added a subscriber: ATsay-WMF.

Let's see if the soon to be created Wikidata Platform team (@ATsay-WMF) has a way to move this forward.

Multichill changed the task status from Open to Stalled.Nov 2 2025, 4:00 PM

One can't complain to have a low level of queries on WCQS on one hand and on the other hand blocking the federated queries from WDQS.