Page MenuHomePhabricator

Anubis Broke OpenRefine
Open, Needs TriagePublic8 Estimated Story Points

Description

We have user reports that Anubis also broke OpenRefine. Ideally we would like to add a UA exception for it. Unfortunately we do not know what user agent we would need to add an exemption for. Instead we will try adding a blanket allow to api.php for MediaWiki and see if this resolves the issues. We must also confirm that it doesn't then expose us to a huge amount of traffic.

For adding content to the allowlist see the relevant anubis docs

see report on OpenRefine Github

A/C:

  • add an exception (or strongly downweight) traffic to api.php
  • confirm openrefine now works
  • confirm we don't immediately see a huge spike in "bad" api.php traffic (e.g. within 12 hours of deploying)

User requests:
https://desk.wikimedia.de/agent/wikimediatickets/swe-wikibase/tickets/details/96405000137968524

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

Just to clarify, nothing on the OpenRefine end is broken, Wikimedia.cloud simply blocks OpenRefine's API access.

I'm just taking a very short look at this to try and get an idea of how easily we can work around this.

Unfortunately it looks like identifying traffic from OpenRefine isn't trivial since it might use a number of different user agents and there doesn't seem to be an exhaustive list/pattern we could add to some allowlist. I would also have expected api paths to get past Anubis without being issued a challenge.

I have checked in our logs and I can see requests with a UA like OpenRefine-Wikibase-extension/3.9.5 [TRUNK] (https://openrefine.org) Wikidata-Toolkit/unknown making it past Anubis fine and getting to a MediaWiki. I also don't see any instances of openrefine OpenRefine etc in the Anubis logs. So I'm not sure right now what is being issued an Anubis challenge.

We need to investigate which requests are getting caught and also how we might be able to prevent them from being challenged. Unfortunately it doesn't seem like the current log output we have from Anubis contains the host. I was also unable to see how to include them from very quickly looking at the anubis docs. Probably our first step needs to be this.

I've also tried to decipher what other UAs it could be by doing quite a bit of log staring but this is probably not worth continuing with blindly. The wiki (dance.wikibase.cloud) does have a load of random traffic (probably bots) so there's quite a bit to dig through.

@Abbe98 if you did have some idea what UA (or also what request path?) we might be accidentally blocking from OpenRefine that would be super helpful; sorry that this change is putting work on the OpenRefine crew but we really felt like we didn't have an option but to do something to reduce the amount of bot we had crushing our infrastructure :)

I was the original reporter in Github.
Here is the request from my locally installed openrefine on my arch linux machine.

There are 2 issues.

  1. a CORS issue that I fixed with the extension cors everywhere
  2. anubis is enabled on the API (which means as you see below that it responds with html when asked for json which is terrible)

request:

curl 'https://dance.wikibase.cloud/w/api.php?action=wbsearchentities&language=en&search=insta&type=property&format=json&origin=*&uselang=en' \
  --compressed \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:143.0) Gecko/20100101 Firefox/143.0' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Origin: http://127.0.0.1:3333' \
  -H 'Sec-GPC: 1' \
  -H 'Connection: keep-alive' \
  -H 'Referer: http://127.0.0.1:3333/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: cross-site' \
  -H 'TE: trailers'

response

<!doctype html><html lang="en"><head><title>Making sure you&#39;re not a bot!</title><link rel="stylesheet" href="/.within.website/x/xess/xess.min.css?cachebuster=v1.21.3"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="robots" content="noindex,nofollow"><style>
        body,
        html {
            height: 100%;
            display: flex;
            justify-content: center;
            align-items: center;
            margin-left: auto;
            margin-right: auto;
        }

        .centered-div {
            text-align: center;
        }

        #status {
            font-variant-numeric: tabular-nums;
        }

        #progress {
          display: none;
          width: 90%;
          width: min(20rem, 90%);
          height: 2rem;
          border-radius: 1rem;
          overflow: hidden;
          margin: 1rem 0 2rem;
					outline-offset: 2px;
					outline: #b16286 solid 4px;
				}

        .bar-inner {
            background-color: #b16286;
            height: 100%;
            width: 0;
            transition: width 0.25s ease-in;
        }
    	</style><script id="anubis_version" type="application/json">"v1.21.3"
</script><script id="anubis_challenge" type="application/json">{"rules":{"algorithm":"fast","difficulty":4,"report_as":4},"challenge":"3df2d0020110409e1622f670d653d948bf4b0ecafc83e76f299331e69d14bab16e701acb5b0e68c90d034a27906c66a3a54ada527a768129eb82debc4423d8b93d52125338ebc21e523c015fade1756e19046072bde04380e8238ef5f54e87ba6b56a456e7eff1f540d564fcc18732d35e7b6b005bdec403defb479988ea0055b871c169558af48a4bd8f28809159f2b7ed188f33bff8f9189402296c0ce0ebb3b27acb86458e33de69f069d98767f40bb19b263fc038ef52dff6905d7d2172f5e46b05ffc764e4c9d6342e0ee67a89e808621db3c502be603fb53cef16c67d9a5beb5751892344edda65691ab19ecd89332942713ec9eeccafe76ee1b5b4013"}
</script><script id="anubis_base_prefix" type="application/json">""
</script></head><body id="top"><main><h1 id="title" class="centered-div">Making sure you&#39;re not a bot!</h1><div class="centered-div"><img id="image" style="width:100%;max-width:256px;" src="/.within.website/x/cmd/anubis/static/img/pensive.webp?cacheBuster=v1.21.3"> <img style="display:none;" style="width:100%;max-width:256px;" src="/.within.website/x/cmd/anubis/static/img/happy.webp?cacheBuster=v1.21.3"><p id="status">Loading...</p><script async type="module" src="/.within.website/x/cmd/anubis/static/js/main.mjs?cacheBuster=v1.21.3"></script><div id="progress" role="progressbar" aria-labelledby="status"><div class="bar-inner"></div></div><details><summary>Why am I seeing this?</summary><p>You are seeing this because the administrator of this website has set up Anubis to protect the server against the scourge of AI companies aggressively scraping websites. This can and does cause downtime for the websites, which makes their resources inaccessible for everyone.</p><p>Anubis is a compromise. Anubis uses a Proof-of-Work scheme in the vein of Hashcash, a proposed proof-of-work scheme for reducing email spam. The idea is that at individual scales the additional load is ignorable, but at mass scraper levels it adds up and makes scraping much more expensive.</p><p>Ultimately, this is a hack whose real purpose is to give a &#34;good enough&#34; placeholder solution so that more time can be spent on fingerprinting and identifying headless browsers (EG: via how they do font rendering) so that the challenge proof of work page doesn&#39;t need to be presented to users that are much more likely to be legitimate.</p><p>Please note that Anubis requires the use of modern JavaScript features that plugins like JShelter will disable. Please disable JShelter or other such plugins for this domain.</p><p>This website is running Anubis version <code>v1.21.3</code>.</p></details><noscript><p>Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.</p></noscript><div id="testarea"></div></div><footer><div class="centered-div"><p>Protected by <a href="https://github.com/TecharoHQ/anubis">Anubis</a> From <a href="https://techaro.lol">Techaro</a>. Made with ❤️ in 🇨🇦.</p><p>Mascot design by <a href="https://bsky.app/profile/celphase.bsky.social">CELPHASE</a>.</p></div></footer></main></body></html>

As someone noted in the Github issue linked above I suggest exluding API urls from anubis and use other means to protect the resources. E.g. rate-limit and block people hammering the API endpoints.

Thanks for the update and for the extra information!

As someone noted in the Github issue linked above I suggest exluding API urls from anubis and use other means to protect the resources. E.g. rate-limit and block people hammering the API endpoints.

From what we observed in the past rate-limiting wasn't sufficient to protect us from this scraping but that was mostly targeting non api.php endpoints. The issue was always that we'd be targeted from a very distributed range of IPs.

However, I'd generally agree with the principle that we'd probably like not reply to api responses with this anubis html blob.

I think it's important to note that this problem is the result of two concurrent issues though. Anubis is likely offering a challenge because the user-agent is looking like a browser rather than some api client. To give a solid example of this if I raw copy paste the curl above but adjust to a meaningful not-pretending-to-be-a-browser UA (e.g. TArrow Curl Client then no challenge is served and you just get the data:

curl 'https://dance.wikibase.cloud/w/api.php?action=wbsearchentities&language=en&search=insta&type=property&format=json&origin=*&uselang=en' \
  --compressed \
  -H 'User-Agent: TArrow Curl Client' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'Origin: http://127.0.0.1:3333' \
  -H 'Sec-GPC: 1' \
  -H 'Connection: keep-alive' \
  -H 'Referer: http://127.0.0.1:3333/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: cross-site' \
  -H 'TE: trailers'
{"searchinfo":{"search":"insta"},"search":[{"id":"P1","title":"Property:P1","pageid":2,"display":{"label":{"value":"instance of","language":"en"},"description":{"value":"type to which this subject corresponds/belongs. Different from P2 (subclass of); for example: K2 is an instance of mountain; volcano is a subclass of mountain","language":"en"}},"repository":"local","url":"https://dance.wikibase.cloud/wiki/Property:P1","datatype":"wikibase-item","concepturi":"https://dance.wikibase.cloud/entity/P1","label":"instance of","description":"type to which this subject corresponds/belongs. Different from P2 (subclass of); for example: K2 is an instance of mountain; volcano is a subclass of mountain","match":{"type":"label","language":"en","text":"instance of"}}],"success":1}%

What's interesting to me is I do see requests from a UA dance.cloud recon service (wikibaseopenrefine reconciliation service) that also happily gets results with no anubis challenge; I wonder if this really is the problematic request or if it is something else.

From the Wikibase.cloud end we could always allow all traffic to api.php (which is already covered by some ratelimits that bad actors currently made their way around by using a huge pool of ips).

From the client's end they could also set a more meaningful UA and would then be unlikely to be challenged.

I think a possible next step would be to allow a temporary exception to api.php but eventually remove this and encourage clients to present a better UA. While we don't officially currently require people to follow a policy similar to the WMF for UAs we probably should and will in the future. We'll need to then look at what the mechanics of doing that is; currently we don't have any adjustments to the policies.

Tarrow updated the task description. (Show Details)
Anton.Kokh set the point value for this task to 8.Nov 5 2025, 2:25 PM

I'm not sure if I'm having the same problem. My script is using Wikibase Integrator and was working a few weeks ago. This is my error message:

Exception has occurred: JSONDecodeError
Expecting value: line 1 column 1 (char 0)
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

  File "/home/olea/git/crm4wb/ttl2wb.py", line 163, in <module>
    login_instance = wbi_login.Clientlogin(user=vars['MW_ADMIN_NAME'], password=vars['MW_ADMIN_PASS'])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

For user-agent I've been using this:

wbi_config['USER_AGENT'] = 'ttl2wb.py/0.0 (https://gitlab.wikimedia.org/olea/wikibase-bootstrap)'

:-?

Thanks @Tarrow for working on this!

There are two points that confuse me:

  • why is Anubis enabled on an API endpoint at all? My understanding is that Anubis is designed to be enabled only on web pages, not API endpoints, since we cannot expect API clients to solve Anubis challenges. I imagine that the high traffic generated by LLM harvesters is primarily targeting web pages, not API endpoints (which are much harder to harvest without knowledge of the application and its API), meaning that API endpoints don't really need this sort of protection anyway. So (to me) it seems to be a mistake to enable Anubis on any API endpoint whatsoever.
  • rejecting requests on the basis of a browser-looking User-Agent: isn't that a problem for web applications which make cross-origin calls to the Wikibase API? The Wikibase API (at least as served on Wikibase.Cloud) includes a access-control-allow-origin: * header in its responses. I understand this as "it's okay to call this API from the frontend of other web apps, served under a different domain", which implies that the corresponding requests will have the user agent of the browser rendering that app. I see that the WMF guidelines recommend using the Api-User-Agent header for that purpose, but it's unclear to me to what extent that's standard (would Anubis recognize that?).

In the interest of enabling people to write tools for Wikibase which work "reasonably out of the box" for any Wikibase, it would probably help if those guidelines were available from Wikibase itself instead of being specific to deployments like WMF or Wikibase.Cloud (but of course there is always going to be variability in the deployments seen in the wild…)

...

  • why is Anubis enabled on an API endpoint at all?

I think the brutally honest answer is that we enabled it everywhere because that was the lowest effort thing to "try". As then often happens it appeared "good enough" and we were delighted that most of the api clients we saw were not served challenges by Anubis and we stopped thinking much more. I think clearly we need to look at it some more and evaluate if this blanket policy makes any sense at all. I know in addition to crazy (i.e. non-sensible) requests to pages we did also see crazy requests to api endpoint but we ought to evaluate the load/pro and cons of that.

  • rejecting requests on the basis of a browser-looking User-Agent: isn't that a problem for web applications which make cross-origin calls to the Wikibase API? The Wikibase API (at least as served on Wikibase.Cloud) includes a access-control-allow-origin: * header in its responses. I understand this as "it's okay to call this API from the frontend of other.

Yes, this is a problem and we've now had a few reports of it. I suspect that a very crude work around would be to have the end user visit the Wikibase and then have the web app make the Cross Origin request with withCredentials.

In the interest of enabling people to write tools for Wikibase which work "reasonably out of the box" for any Wikibase, it would probably help if those guidelines were available from Wikibase itself instead of being specific to deployments like WMF or Wikibase.Cloud

Good idea; I think this could add some clarity, definitely around what user-agents / rate limits etc. we would like people to follow.