Page MenuHomePhabricator

Find a way for pywikibot GitHub Actions to avoid IP range blocks of Microsoft Azure hosted runners
Open, Needs TriagePublicFeature

Description

Pywikibot runs a relatively large set of post-merge tests via GitHub Actions. Some of these tests use Beta Cluster wikis as their target for end-to-end testing of various features.

The efforts to exclude unwanted bots from T393487: 2025 tracking task for Beta Cluster (deployment-prep) traffic overload protection (blocking unwanted crawlers) have recently blocked some of the IP addresses used by GitHub Actions. GitHub Actions uses Microsoft Azure to host many (all?) of it's runners. There are over 5000 (!) IP ranges listed at https://api.github.com/meta that GitHub Actions might make requests from.

Some potential options:

  • Allowlist 5000+ CIDR ranges and keep that list updated.
  • Setup self-hosted GitHub runners for use by https://github.com/wikimedia/ organization projects.
  • Add a SOCKS5 proxy to the appropriate test suites to tunnel traffic to an exit that is unlikely to be blocked
  • Migrate all of these tests to a CI platform that is "in-house" (Zuul or GitLab CI) and unlikely to be blocked

See also:

Event Timeline

I guess my first question is if these tests could run from Wikimedia infrastructure rather than GitHub Actions.

We could probably use self-hosted runners on WMF infrastructure:

But I am not able to set it up I guess but I am willing to support as much as I can if this is an appropriate solution. Background: Previously we had these tests at Travis and Appveyer. Both text matrix were ported to github action due to T296371 and T368192. These tests after Jenkins CI uses a wider variance of sites, Python releases, OS and tests users (Jenkins tests is for en-wiki and IP user only) and helps to verify that the code is ready to be published as a next stable release if tests passes. There are 128 jobs running on github.

To port these tests to Jenkins looks much more difficult to me and I have no idea if and how this would be possible.

The fundamental challenge today is that we only have IP range based blocking setup for the Beta Cluster without any currently documented way to bypass a range block by using a request header/authentication/etc.

I found out that tests needs five times as much time running on beta than before or on other sites (if we were lucky that the runner's IP is not blocked) and I understand the measure. But it is cumbersome to restart the failing jobs every time in the hope of reaching an unblocked IP. Blocking IP cannot be a long-term solution and you also have to ask yourself what to to if other sites than beta are affected. So there should be any bypass mechanism for trusted CI traffic through headers or tokens or maxlagish throttling. But you know that better than I do.

I guess my first question is if these tests could run from Wikimedia infrastructure rather than GitHub Actions.

We could probably use self-hosted runners on WMF infrastructure:

That might be possible. One of the challenges would be finding folks to monitor and keep these runners working. This is probably not an impossible challenge, but t won't be trivial either.

To port these tests to Jenkins looks much more difficult to me and I have no idea if and how this would be possible.

Moving to tests run by zuul + jenkins would probably be possible, but also annoying at the current moment. The Continuous-Integration-Infrastructure (Zuul upgrade) project is working towards changing a lot of things in that CI pipeline so the work would likely turn out to need an initial implementation and then a follow up project to move from Jenkins Job Builder described tests to the ansible replacement.

Yet another option might be figuring out how to mirror the pywikibot code to gitlab.wikimedia.org and then using the self-service CI pipelines there to run your tests. We currently have both locally hosted and externally hosted gitlab-runners. We do not however have windows or macOS runners which are things I see at least a few of the pywikibot GitHub Actions using.

Blocking IP cannot be a long-term solution and you also have to ask yourself what to to if other sites than beta are affected. So there should be any bypass mechanism for trusted CI traffic through headers or tokens or maxlagish throttling. But you know that better than I do.

IP blocking is likely here to stay. We are fundamentally having the same problem as production wikis trying to block LTA type vandals. The compounding issue here is that it not just edit traffic that is causing us problems, but read traffic as well. The production wikis are having the same core problem with aggressive scraper bots (https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-the-operations-of-the-wikimedia-projects/), but they are taken care of by more people and are also getting a focus project to work on adding more automated traffic management. Unfortunately I have doubts that much of that work will be applicable to the beta cluster wikis due to staffing and technology constraints.

The proxy idea started with @thcipriani while discussing the self-hosted runner concept:

[16:55]  <thcipriani> I think keeping it alive shouldn't be too bad, but it'll be yet another thing to keep up-to-date for a strange one-off. I wonder if there's some kind of opensource cloudflare tunnel kinda thing that would be simpler here. Something to make the request look like it's coming from inside the house when folks hit a specific url. That would be less surface area than a github runner.
[16:57]  <thcipriani> maybe not tho
[17:02]  <    bd808> In theory a SOCKS5 proxy or similar would be possible. that might even be something that there is an existing Action for.
[17:04]  <    bd808> https://github.com/marketplace/actions/ssh-socks-action

This is probably the quickest option to test as a possible solution. it would basically need a new Developer account to act as the service account for the ssh access, an ssh keypair, and the new Developer account being added to the Cloud VPS bastion project to give it a place to terminate the SOCKS5 tunnel.

Proxy usage sounds promising . It’s usage is already available through requests package:

Nice. The steps to test this out then are probably something like:

  • @Xqt makes a Developer account specifically to act as the credential holder that can build an ssh SOCKS5 tunnel from GitHub Actions to bastion.wmcloud.org.
  • @Xqt adds an ssh public key to that new Developer account and keeps track of the associated public key for the GitHub Actions configuration.
  • @Xqt asks @bd808 to make the new Developer account a member of the bastion project so it can ssh in.
  • @bd808 does the needful
  • @Xqt figures out how to add config to the GitHub Actions to establish an ssh tunnel from the Action runner to bastion.wmcloud.org. A pure cli way to do this would be something like ssh -o StrictHostKeyChecking=accept-new -f -N -D 127.0.0.1:1080 -i $PRIVATE_KEY_FILE $USER@bastion.wmcloud.org
    • -o StrictHostKeyChecking=accept-new: Accept offered host key for any host not already in the known hosts file
    • -f: Background ssh process after connecting
    • -N: Do not exec a remote command
    • -D 127.0.0.1:1080: Create a SOCKS5 proxy listening on 127.0.0.1:1080 and terminated on the ssh connected host
    • -i $PRIVATE_KEY_FILE: Use the private key in $PRIVATE_KEY_FILE
  • @Xqt adds the needed equivalent of export HTTPS_PROXY="socks5h://127.0.0.1:1080" to the GitHub Actions to tell requests to proxy traffic though the tunnel and do DNS resolution on the proxy termination side so that the internal network IPs are contacted when traffic flows over the tunnel. There are some weird things that might happen if the DNS is done outside the Cloud VPS network. Public IPv4 addresses in Cloud VPS work in ways that are sometimes confusing.

Safely storing and using the ssh private key from GitHub Actions is something that @Xqt should research as part of this too. This is

Ah, this is not a http/https-proxy but SOCKS5 and requests needs requests[socks] which is PySocks. Hope this still works because the package is unsupported for 8 years.

@Xqt I see passing tests upstream marked as using "wpbeta" (e.g. https://github.com/wikimedia/pywikibot/actions/runs/18721896853/job/53396228584). Does this mean that things are working now? Can this task be updated with info about whay y'all had to change and resolved if so?

@bd808 Some of those failing tests are marked to be skipped (1) due to T399367 but code coverage obviously shows that they pass (2). So yes, it is working again now. So I think this issue can be closed-