Page MenuHomePhabricator

Block traffic from user-agents not honoring our policy
Closed, ResolvedPublic

Description

We have a very clear user-agent policy, https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Foundation_User-Agent_Policy stating that:

  • Requests without a user-agent should be blocked
  • Requests not coming from a browser should be properly identified using a distinctive name, a contact email or url, or otherwise be blocked

We've always been exceedingly liberal with this policy, but it's clearly become unsustainable.

We will progressively rate limit, then block requests from these generic user-agents. Our goal is to block all traffic from unidentified clients and not coming from authorized actors, like toolsforge or our internal APIs.

Below is the proposed schedule, limiting first to 10 requests per second per ip, then to 5, then to 1, and finally blocking the traffic completely.

user-agent pattern10 rps/ip5 rps/ip1 rps/ipblock
No user agent--Aug 11Aug 18
library default-Aug 11Aug 18Aug 25
curl/wget CLI-Aug 11Aug 18-
external mw-relatedAug 11Aug 18Aug 25Sept 1

Definitions of patterns:

  • No user agent: requests without a user-agent header, or with an empty value for it
  • library default: requests with the default user-agent string for common software libraries like python-requests, curl, okhttp, go-httpclient, etc.
  • external mw-related: requests with user-agent strings set by MediaWiki (like ForeignApiRepo) or by other mw-related software like WDQS Updater

In the specific case of MediaWiki - generated user-agents, we can't completely block them at the moment because MediaWiki not only uses a non-policy compliant UA string by default, but it also doesn't allow overriding it.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+9 -5
operations/puppetproduction+6 -2
operations/puppetproduction+4 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -2
operations/puppetproduction+8 -2
operations/puppetproduction+5 -2
operations/puppetproduction+5 -2
operations/puppetproduction+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+5 -3
operations/puppetproduction+2 -2
operations/puppetproduction+4 -4
operations/puppetproduction+2 -1
operations/puppetproduction+17 -3
operations/puppetproduction+1 -1
operations/puppetproduction+11 -3
operations/puppetproduction+8 -11
operations/puppetproduction+2 -2
operations/puppetproduction+5 -7
operations/puppetproduction+2 -2
operations/puppetproduction+62 -47
operations/puppetproduction+48 -6
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar subscribed.

This prevent us from updating the CI Jenkins. The updates are done over the API https://integration.wikimedia.org/ci/ using basic authentication (request header Authorization). Can this be exempted please until it is figured out? Either by allowing integration.wikimedia.org or allowing requests with an authorization header.

It is a blocker for CI and I have filed T403089. Lets follow on that subtask.

@Joe Could I ask for a two week exemption for diff.wikimedia.org until we have our next sprint with our devs? Right now folks can't log into the community blog to share their stories and updates and I won't have developer time until the second week of September.

Change #1183118 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] haproxy: Temporary UA policy exemption for Automattic

https://gerrit.wikimedia.org/r/1183118

Change #1183118 merged by Vgutierrez:

[operations/puppet@production] haproxy: Temporary UA policy exemption for Automattic

https://gerrit.wikimedia.org/r/1183118

@Joe Could I ask for a two week exemption for diff.wikimedia.org until we have our next sprint with our devs? Right now folks can't log into the community blog to share their stories and updates and I won't have developer time until the second week of September.

@CKoerner_WMF we've added and applied an exemption for requests coming from diff.wikimedia.org backend. Please let us know the phabricator task you're using to track the fix on your side of things, thanks!

Hello everyone, I'm a contributor to the Wikimedia Commons Android App. The Commons app recently started facing some API issues with code 403 and message "Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119".

See this issue for more context: https://github.com/commons-app/apps-android-commons/issues/6411

We have been sending this UA value for a long time: "User-Agent: Commons/6.0.0-debug-main~f810a2d49 (https://mediawiki.org/wiki/Apps/Commons) Android/13" and we didn't made any changes that might have caused this issue from our side. Could anyone please help with this? Thanks :)

Hello everyone, I'm a contributor to the Wikimedia Commons Android App. The Commons app recently started facing some API issues with code 403 and message "Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119".

Hi @Rohitverma9625, I'll start investigating. Can you provide any more details on which specific URLs or API calls are affected by this? Is it all of the traffic performed by the app, or only some?

Hi @CDanis, thanks for the quick response. We are facing this issue for some URLs as other parts just fetch data correctly. One of core feature of the app is Nearby which sends request to this URL:

GET https://query.wikidata.org/sparql?query=SELECT...
2025-08-29 22:02:12.708 12019-12070 OkHttp fr.free.nrw.commons V ption%20%3Fdescription.%0A%20%20%20%20%3Fclass%20rdfs%3Alabel%20%3FclassLabel.%0A%20%20%7D%0A%7D%0AGROUP%20BY%20%3Fitem%0A&format=json
2025-08-29 22:02:12.708 12019-12070 OkHttp fr.free.nrw.commons V --> END GET
2025-08-29 22:02:12.814 12019-12070 OkHttp fr.free.nrw.commons V <-- 403 https://query.wikidata.org/sparql?query=SELECT%0A%...
2025-08-29 22:02:12.814 12019-12070 OkHttp fr.free.nrw.commons V ption%20%3Fdescription.%0A%20%20%20%20%3Fclass%20rdfs%3Alabel%20%3FclassLabel.%0A%20%20%7D%0A%7D%0AGROUP%20BY%20%3Fitem%0A&format=json (106ms)
2025-08-29 22:02:12.814 12019-12070 OkHttp fr.free.nrw.commons V content-length: 92
2025-08-29 22:02:12.814 12019-12070 OkHttp fr.free.nrw.commons V content-type: text/plain
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V x-analytics:
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V server: HAProxy
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V x-cache: cp5019 int
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V x-cache-status: int-tls
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.
2025-08-29 22:02:12.815 12019-12070 OkHttp fr.free.nrw.commons V <-- END HTTP (92-byte body)

This is one more URL that fails with same error:

<-- 403 https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Beautiful_Coleus.jpg/330px-Beautiful_Coleus.jpg (103ms)
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V content-length: 92
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V content-type: text/plain
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V x-analytics:
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V server: HAProxy
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V x-cache: cp5027 int
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V x-cache-status: int-tls
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.
2025-08-29 22:13:39.568 12343-12431 OkHttp fr.free.nrw.commons V <-- END HTTP (92-byte body)

@Joe Could I ask for a two week exemption for diff.wikimedia.org until we have our next sprint with our devs? Right now folks can't log into the community blog to share their stories and updates and I won't have developer time until the second week of September.

@CKoerner_WMF we've added and applied an exemption for requests coming from diff.wikimedia.org backend. Please let us know the phabricator task you're using to track the fix on your side of things, thanks!

Thank you. I'll keep updating this task as we work on a permeant fix. T403076

Thank you, that helps a lot.

I notice that there's two different HTTP clients used inside the app -- and only one of them sets User-Agent on outbound requests. I really don't know what I'm doing with Kotlin/Android, but, I submitted a draft PR to hopefully fix this: https://github.com/commons-app/apps-android-commons/pull/6415

I notice that there's two different HTTP clients used inside the app -- and only one of them sets User-Agent on outbound requests. I really don't know what I'm doing with Kotlin/Android, but, I submitted a draft PR to hopefully fix this: https://github.com/commons-app/apps-android-commons/pull/6415

Thanks for the PR, I'll test your PR and update you there if it solves the problem. Seems like I missed that file to check for UA header.

Change #1183161 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Exempt query.wikidata.org from U-A policy

https://gerrit.wikimedia.org/r/1183161

Change #1183161 merged by CDanis:

[operations/puppet@production] haproxy: Exempt query.wikidata.org from U-A policy

https://gerrit.wikimedia.org/r/1183161

Note that returning a plain text error message to a JSON API request is going to make it very likely to get swallowed by a JSON parse failure in the client library and never seen by the end user. This obscured the problem for OpenRefine users (and all other clients using the Wikidata Toolkit).

Additionally, totally blocking requests, including login requests from users who are attempting to authenticate (which would provide their identity) seems VERY aggressive.

Note that returning a plain text error message to a JSON API request is going to make it very likely to get swallowed by a JSON parse failure in the client library and never seen by the end user.

That seems like a bug in the client library? If the response code is 4xx/5xx and the content type isn't JSON, there is no reason to try to parse it as such.

Note that returning a plain text error message to a JSON API request is going to make it very likely to get swallowed by a JSON parse failure in the client library and never seen by the end user.

That seems like a bug in the client library? If the response code is 4xx/5xx and the content type isn't JSON, there is no reason to try to parse it as such.

Good point. It looks like they're not checking the status code before attempting to parse the body.

Re-upping a question I had earlier - will the servers' "Retry-After" header use seconds, or http-date, or potentially either? Of course it would be easy to figure out in my code, but it would still be good to know.

Re-upping a question I had earlier - will the servers' "Retry-After" header use seconds, or http-date, or potentially either? Of course it would be easy to figure out in my code, but it would still be good to know.

Seconds. For example: Retry-After: 60

I still have an error in my unit tests from Gitlab CI when trying to access https://upload.wikimedia.org/wikipedia/commons/d/d2/Epichlorhydrin_vzorec.webp

[ERROR]   ImageUtilsTest.testReadImage:29 » IO GET /wikipedia/commons/d/d2/Epichlorhydrin_vzorec.webp => Forbidden -- [content-length:"92", content-type:"text/plain", x-analytics:"", server:"HAProxy", x-cache:"cp1105 int", x-cache-status:"int-tls"] -- Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.

The same test works for two other URLs with the same user agent, why?

Change #1183245 had a related patch set uploaded (by Pppery; author: Pppery):

[operations/puppet@production] Varnish: Fix rate limit comment to match code

https://gerrit.wikimedia.org/r/1183245

Change #1183245 merged by Vgutierrez:

[operations/puppet@production] Varnish: Fix rate limit comment to match code

https://gerrit.wikimedia.org/r/1183245

Change #1183618 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache::haproxy Allow user-agents with contact information

https://gerrit.wikimedia.org/r/1183618

I still have an error in my unit tests from Gitlab CI when trying to access https://upload.wikimedia.org/wikipedia/commons/d/d2/Epichlorhydrin_vzorec.webp

[ERROR]   ImageUtilsTest.testReadImage:29 » IO GET /wikipedia/commons/d/d2/Epichlorhydrin_vzorec.webp => Forbidden -- [content-length:"92", content-type:"text/plain", x-analytics:"", server:"HAProxy", x-cache:"cp1105 int", x-cache-status:"int-tls"] -- Please set a user-agent and respect our robot policy https://w.wiki/4wJS. See also T400119.

The same test works for two other URLs with the same user agent, why?

It looks like that test for some reason is using the default UA of the HttpClient library: Apache-HttpClient/5.5 (Java/24.0.2)

It looks like that test for some reason is using the default UA of the HttpClient library: Apache-HttpClient/5.5 (Java/24.0.2)

Thank you, I forgot I had specific code for this.

Change #1183618 merged by Slyngshede:

[operations/puppet@production] P:cache::haproxy Allow user-agents with contact information

https://gerrit.wikimedia.org/r/1183618

Change #1183693 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert^2 "P:cache::haproxy Allow user-agents with contact information"

https://gerrit.wikimedia.org/r/1183693

Change #1183693 merged by Vgutierrez:

[operations/puppet@production] Revert^2 "P:cache::haproxy Allow user-agents with contact information"

https://gerrit.wikimedia.org/r/1183693

Re the comment: "Allow user-agents with contact information" - implies blocking UAs with no contact information. Is this referring to a subset of queries? I understood from earlier that a legacy client-side app with a UA modeled on a browser UA would be OK (unless it runs into a rate limit). Still true?

Re the comment: "Allow user-agents with contact information" - implies blocking UAs with no contact information. Is this referring to a subset of queries? I understood from earlier that a legacy client-side app with a UA modeled on a browser UA would be OK (unless it runs into a rate limit). Still true?

You're totally right, that's referring to library defaults UAs like python-requests

Change #1183679 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache::haproxy disallow Wikidata Query Service as UA

https://gerrit.wikimedia.org/r/1183679

Change #1184046 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable JA3N fingerprinting CDN wide

https://gerrit.wikimedia.org/r/1184046

Change #1183679 merged by Slyngshede:

[operations/puppet@production] P:cache::haproxy disallow Wikidata Query Service as UA

https://gerrit.wikimedia.org/r/1183679

Another question about response 429, sorry. Is there a range of values I can expect for Retry-After? AWB already retries <s>30</s> at least 5 seconds after any 4xx response, and I'd like to know if that needs to be updated to honor the returned value.

The upcoming django-allauth release should fix the issues with the mediawiki socialaccount provider as well. See this issue for more.

Yeah getting the swagger spec via curl https://api.wikimedia.org/core/v1/wikipedia/en/search/page?q=earth&limit=10 also no longer works I guess.

I have seen no one replied to this comment - I just tried and FWIW if I don't make more than 1 rps, I can download that page consistently with curl and wget.

EDIT: nevermind, we're not blocking curl or wget anymore.

I will tentatively close this task for now.

What about enforcing this user-agent policy in Beta Cluster (https://www.mediawiki.org/wiki/Beta_Cluster) as well so that developers of apps using OAuth can properly sign their requests in testing stage already?

Change #1190004 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] tls: ban default UAs with forge URLs

https://gerrit.wikimedia.org/r/1190004

Change #1190004 merged by Giuseppe Lavagetto:

[operations/puppet@production] tls: ban default UAs with forge URLs

https://gerrit.wikimedia.org/r/1190004

Change #1192210 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:cache::haproxy: exempt mediawiki.org and /keys from UA policy

https://gerrit.wikimedia.org/r/1192210

Change #1193708 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] Add parsoid ua to ua_wdqs

https://gerrit.wikimedia.org/r/1193708

Change #1193708 merged by Fabfur:

[operations/puppet@production] Add parsoid ua to ua_wdqs

https://gerrit.wikimedia.org/r/1193708

Change #1193782 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/puppet@production] Add ua_internals policy

https://gerrit.wikimedia.org/r/1193782

Change #1193782 merged by Fabfur:

[operations/puppet@production] haproxy: add ua_internals policy

https://gerrit.wikimedia.org/r/1193782

Could I request assistance in verifying that we've properly addressed the user-agent policy requirements for Diff's oAuth implementation? It now returns a user agent in the header.

https://github.com/wikimedia/mediawiki-oauth-client-wordpress-plugin/pull/15/files

This can be tested at the following URL (Use the "Login with Mediawiki" button)

https://blog-wikimedia-org-develop.go-vip.net/wp-login.php

The PR looks right to me; not sure if there's an easy way to verify the header is applied to all requests, other than removing the exemption and seeing if anything bad happens.

This seems to be blocking legitimate web captures by the Internet Archive (see, for example, here). Is this actually the intended outcome?

The Internet Archive is a reliable source for viewing old Wikipedia articles exactly as they appeared at the time of capture, in a way that the Revision History feature does not allow (for example, it does not show old revisions of transcluded templates). It is also used to record the history of some special pages that do not normally have their histories saved on Wikipedia — such as Special:Tags and Special:GadgetUsage.

I'm afraid that the complete blockage of Internet Archive captures of all wiki pages is unfortunate, and may cause other legitimate features and services that rely on such captures to cease functioning.

This seems to be blocking legitimate web captures by the Internet Archive (see, for example, here). Is this actually the intended outcome?

No, the internet archive should still be able to access our pages. Lookly like the use UA strings that contain "archive.org_bot", e.g. Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/22cbc49 warc/v0.8.90. We could whitelist that, but perhaps we can whitelist them by IP range, that would be safer. The requests are coming from 207.241.225.0 /20. I'm also seeing some from 66.249.84.0/23 (google proxy), not sure if those are legitimate.

(ftr, this is not something I can personally decide or do)

This seems to be blocking legitimate web captures by the Internet Archive (see, for example, here). Is this actually the intended outcome?

The Internet Archive is a reliable source for viewing old Wikipedia articles exactly as they appeared at the time of capture, in a way that the Revision History feature does not allow (for example, it does not show old revisions of transcluded templates). It is also used to record the history of some special pages that do not normally have their histories saved on Wikipedia — such as Special:Tags and Special:GadgetUsage.

I'm afraid that the complete blockage of Internet Archive captures of all wiki pages is unfortunate, and may cause other legitimate features and services that rely on such captures to cease functioning.

I'm not sure what happened with the one example you gave -- it's possible it was produced by an external tool and then uploaded to IA perhaps? -- but it is definitely not a complete blockage:
https://web.archive.org/web/20251121123039/http://en.wikipedia.org/wiki/Main_Page
https://web.archive.org/web/20251121123020/https://en.wikipedia.org/wiki/Special:Tags

Below is a graph of traffic seen from Internet Archive by our CDN in the past 30 days. It is broken down by HTTP response code. The large blue timeseries is status code 200.

image.png (1×2 px, 427 KB)

Thank you both.

I'm not sure what caused the issue in the example I mentioned in my previous comment (I provided only one, but I noticed a few more blocked captures from Nov 21-22).

Anyway, since subsequent IA captures were apparently successful, I think it's resolved for now. Thanks.

Change #1238751 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] gerrit: allow library-default UAs

https://gerrit.wikimedia.org/r/1238751

Change #1238751 merged by CDanis:

[operations/puppet@production] gerrit.wm.o: allow library-default UAs

https://gerrit.wikimedia.org/r/1238751