Page MenuHomePhabricator

Find a solution for SPARQL federation that is blocked by stricter user agent policy enforcement
Open, HighPublic

Description

It seems that with the stricter enforcement of the user agent policy we are blocking legitimate SPARQL requests. These are requests where people are writing federated SPARQL queries on another SPARQL endpoint and as part of that federate with Wikidata Query Service. Setting a compliant user agent doesn't seem to be possible with some (even most?) query backends unfortunately.

The ability to write federated SPARQL queries that include data from Wikidata is pretty core to our work on building out the Wikibase Ecosystem and taking off load from Wikidata. Is there any solution we can find here?

Here is a request that is blocked but that we believe should be allowed:

curl 'https://sparql.fornpunkt.se/query' -X POST -H 'Accept: application/sparql-results+json,*/*;q=0.9' --data-raw 'query=PREFIX+schema%3A+%3Chttp%3A%2F%2Fschema.org%2F%3E%0APREFIX+oa%3A+%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Foa%23%3E%0A%0ASELECT+%3Fwikidata+WHERE+%7B%0A++%3Fannotation+oa%3AhasTarget+%3Chttp%3A%2F%2Fkulturarvsdata.se%2Fraa%2Flamning%2F401055fc-e795-4e2c-8e34-c45dfde18e61%3E+%3B%0A++++++++++++++oa%3AhasBody+%3Fbody+.%0A++%3Fbody+schema%3AsubjectOf+%3Ftarget+.%0A%0A++SERVICE+%3Chttps%3A%2F%2Fquery.wikidata.org%2Fsparql%3E+%7B%0A++++%3Ftarget+schema%3Aabout+%3Fwikidata+.%0A++%7D%0A%7D%0A'

Upstream issues for adding User-Agent support:

  • Fuseki: apache/jena#3148 (fix shipped in Jena 5.5.0 – sends something like ApacheJena/5.5.0, no customization supported yet)
  • Oxigraph: oxigraph/oxigraph#1456 (Oxigraph currently sends Oxigraph/0.5.0, which is enough to make the request go through; upstream issue is for allowing further customization, but not strictly necessary at the moment)
  • WDQS: N/A, sends User-Agent: Wikidata Query Service; https://query.wikidata.org/ by default which may be rejected in future but can be overridden via the $USER_AGENT environment variable. To fix this for Wikibase Suite - T405233
  • Virtuoso: TBD?

Event Timeline

Affected SPARQL backends appear to at least include Fuseki and Virtuoso.

Hi @Lydia_Pintscher , SRE can make some exception here. It seems warranted given the status quo in the broader sparql ecosystem.

But I want us to get the scope of the exception right:
Is federation traffic always against the /sparql endpoint?
Does federation traffic always set an Accept header like Accept: application/sparql-results[...] ?

Thanks!

Change #1183161 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Exempt query.wikidata.org from U-A policy

https://gerrit.wikimedia.org/r/1183161

Change #1183161 merged by CDanis:

[operations/puppet@production] haproxy: Exempt query.wikidata.org from U-A policy

https://gerrit.wikimedia.org/r/1183161

Hi @Lydia_Pintscher , SRE can make some exception here. It seems warranted given the status quo in the broader sparql ecosystem.

But I want us to get the scope of the exception right:
Is federation traffic always against the /sparql endpoint?
Does federation traffic always set an Accept header like Accept: application/sparql-results[...] ?

Thanks!

Thank you so much!
For the first I believe yes. For the second I don't know. I hope some of the other subscribers can chime in.

Hi @Lydia_Pintscher , SRE can make some exception here. It seems warranted given the status quo in the broader sparql ecosystem.

Hey Chris,

Does this exception have any implication for rate limiting, on the cdn side of things?

AFAIK we currently only rate limit requests in blazegraph (wdqs store), would it be an option to shift that to the edge?

Based on my conversation with Amy Tsay, Wikidata Query service team is looking into it. Hence, removing SRE related tags to avoid confusion.
@CDanis - FYI

Based on my conversation with Amy Tsay, Wikidata Query service team is looking into it. Hence, removing SRE related tags to avoid confusion.

Understood. Wikidata team please advise if/when you'd like the user-agent policy exception revisited.

Does this exception have any implication for rate limiting, on the cdn side of things?
AFAIK we currently only rate limit requests in blazegraph (wdqs store), would it be an option to shift that to the edge?

As discussed on Slack, probably not.

Thanks, @CDanis. We will follow up on the user-agent exception once the team has had a chance to align internally.

A little late to the party, but does this also affect OpenRefine SPARQL reconciliations? I don't know much about SPARQL in general, but I've been attempting to reconcile data using two different columns in OpenRefine (calling P569@year for birth year in addition to the individual's name) which the program turns into a SPARQL query. Those reconciliation attempts have been giving me 403 errors while regular, one-column reconciliations are working just fine.

@Lupascriptix This is likely the same underlying change but unrelated to this ticket. The OpenRefine team recently made a new release to address some of the issues: https://github.com/OpenRefine/OpenRefine/releases/tag/3.9.5#sq_i504e0cyt9 If that doesn't fix it or you I recommend opening a ticket there.

Hi @Lydia_Pintscher , SRE can make some exception here. It seems warranted given the status quo in the broader sparql ecosystem.

But I want us to get the scope of the exception right:
Is federation traffic always against the /sparql endpoint?
Does federation traffic always set an Accept header like Accept: application/sparql-results[...] ?

Thanks!

To clarify better: any exception is going to be temporary while you fix your problems. We can extend it if we can narrow it down a lot, as @CDanis suggested.

Specifically, I don't think "most clients don't support setting a user-agent" as an unsurmountable problem client-side. I would expect this exception to be lifted at the start of 2026; if you want more time, then I think this needs a discussion. For now I'll clearly indicate this and other exceptions will be removed at the start of 2026 unless an extension is negotiated.

In case people are looking Wikibase Cloud does set a user agent for federated requests originating from us since T397052#10923980 (Wikibase.Cloud Query Service (<version>)) . We're using WDQS under the hood.

Actually it would probably be a good thing to encourage people not to use the generic Wikidata Query Service; https://query.wikidata.org/ if they are running a separate instance of the wikibase/wikidata query service so we can differentiate that traffic.

Specifically, I don't think "most clients don't support setting a user-agent" as an unsurmountable problem client-side. I would expect this exception to be lifted at the start of 2026; if you want more time, then I think this needs a discussion. For now I'll clearly indicate this and other exceptions will be removed at the start of 2026 unless an extension is negotiated.

The platform team is aligned on providing a deadline by when the exception will be lifted. Beginning of 2026 is a good date for now - we will update this ticket once we've had the chance to coordinate with WMDE on implementing a more scalable solution and/or narrowing the scope of the exception.

BTracy-WMF moved this task from Incoming to Analysis on the Wikidata-Query-Service board.

Hi @CDanis , we recently finalized a decision brief on next steps for this issue. We aligned upon revoking the exception and working with WDQS users to ensure their federated queries are compliant with the UA policy. To ensure we have enough lead time to communicate the upcoming change, we are requesting that the policy exception remain in place until February 18th, 2026.

Please let me know if this poses any issues on the SRE side.

Hi @CDanis , we recently finalized a decision brief on next steps for this issue. We aligned upon revoking the exception and working with WDQS users to ensure their federated queries are compliant with the UA policy. To ensure we have enough lead time to communicate the upcoming change, we are requesting that the policy exception remain in place until February 18th, 2026.

Please let me know if this poses any issues on the SRE side.

Sorry, I missed this message as I was on PTO. The original agreed date was start of the year; however - February 18 is ok from our point of view, as long as there aren't further delays.

Thanks, @Joe . We're firmly committed to Feb 18th.

Hello all, I am one of the SREs that supports the Wikidata Query Service.

Beginning at about 1000 UTC today, WDQS came under attack from a scraper. At the height of the attack, the majority of queries to the service returned 4xx or 5xx errors (see the "Max Lag" header on the linked dashboard).

At around 1200 UTC, we implemented a rule to block bot traffic. This significantly dropped error rates and allowed the service to recover. But then we noticed that it was too strict (SPARQL query rate per minute, displayed near the top of the linked dashboard, dropped too sharply), so we relaxed the rule around 1410 UTC. It appears that traffic has recovered back to its normal levels.

If you believe your well-behaved bot is still being throttled excessively, please respond here and let us know. AI scrapers are very aggressive and while our toolkit to deal with them is evolving, sometimes we have to take extreme measures to protect the service. We apologize to anyone who was impacted by this issue.

WDQS is under attack again, and I have re-implemented the stricter rules that block (or at least attempt to block) bot traffic. Apologies once again for the anyone who was disrupted.

WDQS is under attack again, and I have re-implemented the stricter rules that block (or at least attempt to block) bot traffic. Apologies once again for the anyone who was disrupted.

Hello, I am using AcgServiceBot/0.1 (https://github.com/Func86/anilist-wikidata) as the User-Agent header, and I noticed my automation workflow (2 requests per-hour) failed for the last 4 hours consecutively with HTTP 403. After investigation, it seems that setting the Api-User-Agent header to the same value fixed the issue for me. This feels wrong to me, the user agent policy only mentions it as a workaround for clients that cannot set the standard user agent header.

Update: Setting Api-User-Agent only works for me locally, the GitHub workflow still fails with HTTP 403:

Request served via cp4043 cp4043, Varnish XID 637254523
Upstream caches: cp4043 int
Error: 403, Please respect our robots policy https://wikitech.wikimedia.org/wiki/Robot_policy (1e30f7b) at Tue, 27 Jan 2026 14:25:53 GMT

Hello @Func , you are correct, this was not the proper message to send. I applied the block (and messaging) hastily as the service was completely down.

I have loosened up the rules and your bot should no longer be affected. If you or anyone else running a well-behaved bot is still being blocked, please let us know.

Hello again, I just wanted to let everyone know that the previous ruleset was also too strict, and we've loosened it further. Again, please feel free to reach out if you're running a well-behaved bot and are still affected.

Hi @CDanis , we recently finalized a decision brief on next steps for this issue. We aligned upon revoking the exception and working with WDQS users to ensure their federated queries are compliant with the UA policy. To ensure we have enough lead time to communicate the upcoming change, we are requesting that the policy exception remain in place until February 18th, 2026.

Please let me know if this poses any issues on the SRE side.

Sorry, I missed this message as I was on PTO. The original agreed date was start of the year; however - February 18 is ok from our point of view, as long as there aren't further delays.

Gentle reminder, tomorrow we'll remove the exception to the UA policy.

Change #1240616 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] cache::haproxy: remove exception for query.wikidata.org, automattic

https://gerrit.wikimedia.org/r/1240616

Change #1240616 merged by Vgutierrez:

[operations/puppet@production] cache::haproxy: remove exception for query.wikidata.org, automattic

https://gerrit.wikimedia.org/r/1240616

Hello all.

Looking for solutions to debug my federated queries, I ended up here.

I'm rather new to wikidata and SPARQL and last year I wrote some federated queries using wikidata and uniprotKB in service clause, and the main endpoint being my own local KG.

Now the queries don't work.

I'm sending the queries from a python script with sparqlWRAPPER.

I tried adding:

sparql.addCustomHttpHeader(
    "User-Agent",
    "MyApp/1.0 (myuniversitymail@univ-mail.fr)"
)

to no avail.

To be host I'm quite lost and not sure what i'm doing. If someone can redirect me to some kind of documentation or explatation of why wikidata blocks queries now, I would appreciate it. I don't really understand the discussion in this thread.

Thanks,
Kind Regards

PS: here is a sparql query that i'd like to see working...

curl -X POST "https://cgen-kg-ica.bird.glicid.fr/cgkg4ica/sparql"   -H "Accept: application/sparql-results+json"   --data-urlencode "query=PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT * WHERE {
  SERVICE <https://query.wikidata.org/sparql> {
    SELECT * WHERE {
      ?wp wdt:P352 ?uniprot .
      }
    LIMIT 10
  }
}"

@Bodral

From what I understand, the problem is that you're trying to establish UA for the query that goes inside your endpoint, but instead it needs to be set on the query that goes outside of your endpoint into WDQS.

So instead of this:
Your script —(UA)→ your endpoint → WDQS
it should be this
Your script → your endpoint —(UA)→ WDQS

What is the software that your endpoint is running on?
It should be configured to send the compliant UA when federating queries outwards.

@Anton.Kokh
It's a Fuseki server with a tdb2 database. I pass a config.ttl file when launching the jar containing the endpoint name, the path to the tdb2 and allowed operations. Is this where i can specify the new setting?

EDIT: I tried adding adding

-Djena.http.userAgent="MyApp/1.0 (myuniversitymail@univ-mail.fr)"

when launching fuseki. The queries stalled a bit longer than before then 'Forbidden' message was sent back.

I'm reading in this github thread that I need to update to Jena 5.5. I have 5.3. Did anyone solve this issue by updating?

EDIT2: Updating java to 21 and fuseki to 6.0, and even without adding -Djena.http.userAgent= when lanching fuseki made the query inside a wikidata service clause work.