Page MenuHomePhabricator

Use HTTP Cookie to legitimate BETA cluster access
Closed, DeclinedPublicFeature

Description

Feature summary:

  • Make BETA cluster beta.wmflabs.org available if apparently a human being is trying to access any page.
  • Use an HTTP cookie not obvious to external bots.

Use case(s):

  • Anybody attempting to even read, not only editing any page in BETA cluster, might be blocked now to defend against AI crawlers.
  • Bots, most probably sent to feed AI LLM, are accessing pages in tremendous frequency, some 100,000 attempts per day and bot.
  • Distinguishing bots and humans quick and easy is needed, otherwise BETA cluster cannot be used any longer.

Benefits:

  • All people developing things via BETA testing.
  • Currently most are excluded.
  • Some days ago I learnt that Vodafone devices are blocked. There are 400 million contracts around the planet, this is ridiculous.
  • AI crawlers are bounced back by many websites due to their intrusive behaviour. They are using botnets now, hiding at regular end user accounts with compromised devices.
  • See T420833 and T226688 etc.

Security through obscurity:

  • Naturally, this approach is not bullet-proof.
  • However, crawlers are trying to read from all websites around the planet. The implementation is universal, not WMF specific.
  • On BETA there are a few meaningless test pages, used for Lua, JavaScript, CSS or advanced template development. No large site, no interesting content. No human crawler developer is supposed to provide such cookies for this rare particular domain.

Last remedy:

  • The common approaches began to get useless. IP blocking is causing more damage than identifying bad guys.
  • IP ranges are used from regular general ISP.
  • User agents are pretending contemporary human Firefox, Edge, Android etc.
  • No other way left to distinguish good and bad attempts.
  • Even it works for a couple of months, then the first bot showing up using a cookie might be blocked again by changing cookie name or changing the value into two or 42 or whatever.
  • Since no other way to defend is known, this approach does not take much implementation and should be tried.

Cookie details:

name:    GoodFriend
value:   1
domain:  beta.wmflabs.org
path:    /
expires: Thu, 31 Dec 2026 23:59:59 GMT
  • On first stage just the existence of such cookie may be tested.
  • If compromised, the value might change and advertised in the backyard.
  • Users might enter manually into browser forms.
  • Alternatively, they could inject JavaScript into page console.
document.cookie = "GoodFriend=1; domain=beta.wmflabs.org; path=/; expires=Thu, 31 Dec 2026 23:59:59 GMT";
  • Note that mw. object is not available; therefore neither mw.loader.using() nor mediawiki.cookie can be used.
  • Expiry date is on user decision.

Adevertising:

  • On the blocking message page wikitech:Beta/Blocked is mentioned. No hint shall be given there, in case, AI or human developers are detecting and following that URL.
  • mw: or WP:BETA might communicate technical details.
  • enWP technical village pump or similar at Commons might tell a tiny hint to developers recommending to look at this section.
  • User talk pages might receive a note, via mass mail, based upon known BETA development accounts or bd808 requests.
  • If no human developer here tells the unknown bot maintainers with no feedback pages, it is unlikely that they learn to provide that particular cookie.

Event Timeline

The canonical address used by Beta Cluster is beta.wmcloud.org since last year (T289318).

Setting cookies on .beta.wmcloud.org is not valid, in the same way that setting a cookie on .org or .co.uk is not valid. It is included in the PSL for security reasons and for consistency with production.

  • Open https://en.wikipedia.beta.wmcloud.org/
  • Run document.cookie = "GoodFriend=1; domain=beta.wmcloud.org; path=/; expires=Thu, 31 Dec 2026 23:59:59 GMT";
  • Reload
  • Observe in DevTools/Network that the request did not send a GoodFriend cookie, and in DevTools/Storage that there is no GoodFriend cookie.

We may also want to use Max-Age in the snippet, so that it doesn't need to be updated in documentation and is always valid for a year from when the snippet is run.

Something like the following should work, e.g. after opening https://en.wikipedia.beta.wmcloud.org/:

document.cookie = "GoodFriend=1; domain=.wikipedia.beta.wmcloud.org; path=/; Max-Age=31536000";

It would of course need to be set separately for each top-level family. And, you can only do so when visiting that family, because it is not allowed in browsers to plan cookies on domains beyond the current/parent domain.

This does have the caveat of not covering upload.wikimedia.beta.wmcloud.org. Perhaps we can exclude and always allow that on its own, because it involve expensive webserver/database resources. And, without acces to a wiki page to discover image URLs, this is unlikely to get targetted directly?

The canonical address used by Beta Cluster is beta.wmcloud.org since last year (T289318).

Sorry, but I have no access to BETA since March 2025. My regular environment is connected via a general ISP with >10 M customers in Europe, frequently changing dynamic IP ranges, all blocked.

  • Several weeks ago I succeeded to connect my desktop environment via smartphone and slow Bluetooth, but after one week my mobile provider has been blocked as well.
  • I gave up.
  • I discontinued both software testing and also development one year ago.

I do hate IP blocking. It is an approach of last century. In 1999 you knew that you are blocking a particular department of a certain university. Nowadays all mobiles and most desktop sites are using universal providers, including privacy protecting VPN offered by browsers and security tools and the ISP. It became pointless to identify a bad guy via one constant IP address.

bd808 subscribed.

I agree that IP range blocks are a crude implement. I don't see any viable use of an undocumented cookie however as a partial workaround.

I also see no evidence that @PerfektesChaos has submitted an unblock request for any IP ever in Beta Cluster using the method documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Blocked_help#HTTP_403_Forbidden. I would encourage them to do so and let us attempt to provide relief using the current mechanisms available.

My ISP are using frequently changing dynamic ranges. I do not have one particular IP range I could ask to be unblocked for an entire week.

Recently I learnt that Vodafone end users are blocked as well. In Germany, they use a different IP range every hour, every time you update a connection by router reset or switching into airplane mode you will get a new IP, which might be in the same range, or in an entirely different world.

IPv4 adresses are rare for general ISP, and they do use a large pool of ranges which are leased by the company. They are shifted even between countries. Due to the BETA blocking I did observe my IP in 2025. An IP I used in April in Germany was told by whois to be resident in Spain in September.

The concept of IP blocking is assuming that one person has the same IP address, or at least keeping the same range for days and weeks and months. That is a view of last century, but when assigning IPv4 to large communities the ISP are running out of free IPv4. They release the used IPv4 assigned to a connection, and reuse it immediately for somebody else as soon the connection is terminated. If you show up in the network again, you get a new IPv4 from the pool of ranges currently available in this country or region. If they are running out of ranges in one region the ISP will shift some ranges to another country served by this company.

There is a no-open-proxy-policy at WMF. This became pointless. Vodafone is one of the three largest open proxies in the world. They have 400 millions of costumers. They all share the same ID VodafoneInternational, but their IPv4 ranges are moving around the planet. This is the perfect open proxy. All 400 millions do share the same pool of ranges.