Page MenuHomePhabricator

Remove authentication from Wikimedia Commons Query Services (WCQS)
Open, HighPublic

Description

It was fine for the WMQS beta to have authentication, but the production SPARQL endpoint shouldn't be limited by authentication. Such a shift of policy with such implications is is not something a team of the WMF should unilaterally decide. This is something that should go all the way up to the WMF board to decide. So please disable it.

At https://commons.wikimedia.org/wiki/Commons_talk:SPARQL_query_service/Upcoming_General_Availability_release#Mandatory_authentication_considered_harmful Andrew lists why this shouldn't be done.

Regarding this part of the announcement:

    "The biggest change to user behavior will be the requirement for user authentication to use all endpoints." 

To my (@Fuzheado) recollection, this is the only instance of needing to be authenticated to experience the main corpus of Wikimedia content. So this is a major policy shift. There are a number of concerns:
Endangered species? Are publicly viewable knowledge graphs like this at risk with WCQS locked up behind an authentication system?

    Restricted reading. We've come so far in finally spending the time to model and add millions of statements to Commons (huzzah!) and we finally have a usable query service for it (huzzah!) and then for the last mile, we're restricting access to it by instituting authentication? For a community that has "open by default" as an ethos, it feels like such an "own goal" misstep here.

    Tools implications. I'm thinking about the number of tools, scripts, and utilities that utilize SPARQL queries via Wikidata/WDQS that have given us tremendous capabilities... and the same approach or set of activities cannot be realized for WCQS because of this constraint. We cannot underestimate the headache of having to implement OAuth2 for each and every SPARQL query. I'm also puzzled how a service that has not even launched yet has to be this closed when none of our other APIs and services have started this way. Other tool creators have shared this common concern at this Phabricator thread T290300.

    Public perception. In terms of public outreach, especially for our GLAM-Wiki work, this is hard to swallow and reconcile with what we are evangelizing. As we are asking cultural and heritage partners to open up their collections and to share their metadata, we are doing so with the expectation of showcasing the benefits of open knowledge to the world. Or we thought we were. With this WCQS policy, every mention of "open content" and "open access" will require an asterisk. This will introduce an asymmetry in contributing content and experiencing its benefits.

    Alternative solutions. I am sympathetic to the complex support issues when any service is made available for public access, whether it's the Mediawiki API or a SPARQL endpoint. However, our "open by default" ethos is a core tenet for the movement and for equitable access to knowledge. Like-minded entities like openverse have found ways to have different tiers of access, while not requiring API keys. We should bend over backward to find "least restrictive" solutions such as throttling or limiting call frequency before we completely block access with mandatory authentication.

Thanks. - Fuzheado

Notes from round table discussion: https://docs.google.com/document/d/13BFQqjfAbzek8pmpLJenyQyxqM1VQiij1lYNjcOCyX0/edit#heading=h.gkr3sreu7vcd

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Multichill renamed this task from Remove authentication from Wikimedia Commons Query Services (WMQS) to Remove authentication from Wikimedia Commons Query Services (WCQS).Dec 18 2021, 5:06 PM

We may introduce two level of service (authenticated and anonymous) with separate resources and different timeout (e.g. 15s/60s or 60s/300s).

As mentioned in the WCQS beta 2 announcement, authentication (or the lack of it) on WCQS is an issue that requires further discussion and planning after the beta 2 release. I'll mark this as high priority as a feature request.

adding security to WCQS, might have an unexpected effect. Since it is not possible to write a federated query where the query is submitted to a remote SPARQL endpoint, it is only possible to run federated queries directly on the WCQS, which means that WCQS needs to deal with all the complexity of a query. Removing that login requirement would allow the majority of the complexity can be dealt with at a remote endpoint.

Can the password feature on the SDCQC please, please, please, please pretty please be removed/disabled? The SDCQC is an epic feature, but almost useless thanks to the requirement to log in. Basically, Commons remains a data silo on its own.
I keep running into issues where I am building a query that I want to share, reuse in a jupyter notebook or run a federated query from Wikidata. The decision to Oauth here is really a poor design choice.

FYI, OpenRefine will likely implement a SPARQL importer in the upcoming time (May-August 2022) through an Outreachy internship. Many OpenRefine users have requested to make it possible to start OpenRefine projects from a SPARQL query.

I have explicitly asked to also investigate if it is even possible for us to start a project from a WCQS query. In our user research (as part of the Wikimedia-funded project to include Wikimedia Commons functionalities in OpenRefine), Wikimedia Commons users have asked for this feature.

Example of use case : Wikimedia's LinguaLibre.org has 700k+ files on Commons, we also have a mw plugin, QueryViz, to display Query Services' responses within our wikipages such as Help:SPARQL, See live test Help:SPARQL/test :

  • Lingua Libre Query Service: works
  • WikiData Query Service: works
  • Wikimedia Commons Query Service: fails due to current settings.
  • Wikimedia Commons Query Service: fails due to current settings.

For #lingua_libre, a possible alternative while still being authenticated: somehow get the wcqsSession token (not sure how) and send it to the WCQS endpoint when sending queries with XHR. This could work, except WCQS does not correctly respond to preflight requests from browsers.

--> OPTIONS https://commons-query.wikimedia.org/sparql?format=json&query=SELECT%20%3Flang%20%3Fiso%20%3FlangLabel%2...
<-- 307 Temporary Redirect (Should be a 204 with the list of acceptable headers, instead it simply redirects to the authentication page)

I do not know if that requires a separate issue. In any case, requiring authentication and not yielding any CORS access controls for XHR authentication severely hinders our capacity to do anything useful with this querying service.

Honest question, is anyone even uisng WCQS right now? https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wcqs&from=now-2d&to=now&viewPanel=18 suggests that it's getting ~2.5 queries per second. WDQS (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&from=now-2d&to=now&viewPanel=18) on the other hand is two orders of magnitude higher, ~250q/s.

My impression from speaking with multiple people at the hackathon is that the authentication requirement is a barrier for people to even use the service in the first place, which defeats the point.

My impression from speaking with multiple people at the hackathon is that the authentication requirement is a barrier for people to even use the service in the first place, which defeats the point.

Exactly!. Thanks for pointing this out. We should however be wary that those low numbers don't reflect a lack of potential. Once and a while I try the WCQS and I keep being impressed by its potential, only to be disappointed when trying to use it at scale. We should keep reiterating that introducing the authentication to the SPARQL endpoint was a bad design choice.

My impression from speaking with multiple people at the hackathon is that the authentication requirement is a barrier for people to even use the service in the first place, which defeats the point.

Yup. I'm not really interested in creating queries if i can't share them with other people who don't / can't / won't go through the authentication hassle or building tools around the service when they're not going to be used for the same reason.

I would suggest just removing the authentication for a week and see what happens.

I would suggest just removing the authentication for a week and see what happens.

I'm curious what you expect to learn during this week. I'm not necessarily against a trial, but to me it seems clear that 1) a number of people who currently aren't using WCQS would if there was no auth check, and 2) Bad/complex/etc. queries will most likely take down WCQS, just like they do for Wikidata. Is a week long enough to get a sense of what either will be like?

But a service with 100% uptime that no one uses is still useless, so I lean towards removing the auth restriction outright and just treating it the same as WDQS. My impression is that we're not introducing a new problem, just extending the same problem that is being explored on the WDQS side and will benefit from the same fix whenever it's ready.

I would suggest just removing the authentication for a week and see what happens.

I'm curious what you expect to learn during this week. I'm not necessarily against a trial, but to me it seems clear that 1) a number of people who currently aren't using WCQS would if there was no auth check, and 2) Bad/complex/etc. queries will most likely take down WCQS, just like they do for Wikidata. Is a week long enough to get a sense of what either will be like?

I also don’t think this trial would be very useful – or at least, I probably won’t add WCQS support to my tools if I’ll have to remove it again after only a week.

Looks like it is still 2-3 reqs/s. Is that the target usage rate? A fine metric for a query service is that it is used to its desired load. If underused, it could be made open-auth until this is reached, and then somehow have a conversation with those users re: options for rate limiting or uptime degradation.

To the common need Husky + others note of wanting to build tools anyone can use w/o getting an error: for a given load you could split it between part for open usage and the rest for auth-only usage, on different clusters so the latter is available even when the former is brought down. [then tune the size of each cluster based on need, maybe one can go away completely if not used]

With some discussions with volunteers in Wikimania, some people are really keen on having ability to share query results of WCQS/WDQS without auth even if it's stale. Like basically HTML of the result copied and stored somewhere.

A general problem of pressure on query services is that sometimes I just want the cached data to share it, sometimes I want to run it right now. Something that's already decently solved in Quarry. In Quarry, sharing a db query result doesn't need an account nor does anything heavy internally (it doesn't do any actual db query) but building a new query to share does need the OAuth dance.

some people are really keen on having ability to share query results of WCQS/WDQS without auth even if it's stale. Like basically HTML of the result copied and stored somewhere.

See also: T104762: Setup sparqly service at https://sparqly.wmflabs.org/ (like Quarry but for SPARQL)

Has there been any recent progress on this? T348269 by one of the preeminent reusers of SDC, in both scale and impact, seems relevant; but that remains untriaged and this one is High but unassigned.

Usage seems surprisingly steady at 2.5-3 q/s, is there any information about how it's currently being used?
cc @Ladsgroup @Spinster @Fuzheado

In T297995#9695199, @Sj wrote:

Usage seems surprisingly steady at 2.5-3 q/s, is there any information about how it's currently being used?

I think that's just monitoring and health checks (it averages to 1 query per server per second), I'm skeptical anyone is actually using it.

With some discussions with volunteers in Wikimania, ...

This keeps coming up for me too, at basically every in-person event I attend; T297995#8872458 is from the 2023 Wikimedia Hackathon, it was also brought up during a GLAM discussion at 2023 WikiConference North America, and then again this past weekend at 2024 WikiConference North America.

Aha, thanks @Legoktm. Perhaps we could

a) get stats on non-monitoring usage [to confirm that this is low], and 
b) turn off auth for a month to see how usage changes?

@Ladsgroup @AUgolnikova-WMF what do you think?

[The above comment is removed given some other discussion that is going around on this topic]

Turning authentication off for a month wouldn't exactly incentivize anyone to use it. I certainly wouldn't build something dependent on a service I knew would break in 30 days.

Just as a note. I am using WCQS and my most use cases are

  • download and track certain properties added to SDC ( Finna id, phash, dhash identifiers, author metadata of the photos, coordinates ... ) for FinnaUploadBot
  • query images for articles without photos for the competitions (Federated queries between WQS, WCQS)
  • stats

I am mostly using Pywikibot for this as we figure out how to do the WCQS authentication using it ( T345342 ), however even with Pywikibot user needs to time to time to go manually to https://commons-query.wikimedia.org using web browser to keep the Pywikibot authentication working as the authentication will silently fail after some weeks without doing it.

Also using pywikibot doesn't also solve the problem how to do a federated SPARQL queries from other endpoints to WCQS. (For example from local Apache Jena Fuseki or OpenStreetMap's Sophox)

It would be much easier to use if the authentication would be removed.