Page MenuHomePhabricator

Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response)
Closed, ResolvedPublic

Description

The wikimedia_nets Varnish ACL includes the 10.68.0.0/16 range used by the old nova-network OpenStack region, but it does not include the new 172.16.0.0/21 range that is used by the "eqiad1-r" OpenStack region that now hosts the majority of Cloud VPS traffic. This is also starting to effect Toolforge tools as they migrate from the old Ubuntu Trusty grid to the new Debian Stretch grid.


Original report:

Kiwix runs Wikimedia wikis scrapers on 5 boxes of the Wikimedia cloud VPS:
https://tools.wmflabs.org/nagf/?project=mwoffliner

We have been doing so since years but we have recently migrated our (mostly manual) system to a fully automated one called the Zimfarm. The Zimfarm is based on Docker and basically schedules and runs Docker containers to scrape the wikis. These containers run an instance of mwoffliner, a Mediawiki dedicated scraper which deals with the Mediawiki API to retrieve the necessary information and content.

Unfortunately, and for an unknown reason, it seems that we face now, a lot of HTTP 429 responses from the Wikimedia API. This basically stops us to scrape the wikis properly. I don't know why we have this problem now: on our side the software is the same and I assume nothing has changed in the configuration on Wikimedia side. Maybe this is the migration to the new VM OS and/or the new VPS datacenter we had to do in December?

Does someone has an explanation? a solution?

Our ticket: https://github.com/openzim/mwoffliner/issues/496

Event Timeline

Not sure this is VPS related.

On a different note, is this impossible to be done from the dumps?

Legoktm subscribed.

HTTP 429 is rate limiting... https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429

Since these are calls to api.php, I think the rate limit is controlled/imposed by the Traffic team.

herron triaged this task as Medium priority.Jan 10 2019, 10:24 PM

On a different note, is this impossible to be done from the dumps?

As I recall Kiwix needs the Parsoid output of the pages rather than the raw wikitext, so the current dumps won't help them.

Hi, I am the dev of Zimfarm (the system automating the scrape process). I can run the scraper at home successfully (I am in the Boston area). But same command on VPS will error out with 429.

Could a change to coming from a 172 address have effected ratelimit whitelisting?

The 429 response is definitely a rate limit on on the Wikimedia side. It is not obvious to me by looking at the upstream nodejs source what if any rate limiting the application is applying internally. There are a whole lot of variables to consider in why the same bulk scraping code would work when run from one network location and fail when run from another absent any settings changes. It would be helpful if someone could instrument the code and determine the actual rate of page fetches that is failing. The code itself should actually expect and account for 429 responses. The correct response to a 429 result would be to pause and then return to scraping at a reduced velocity.

Could a change to coming from a 172 address have effected ratelimit whitelisting?

Yes. The VCL code that performs the rate limiting is in modules/varnish/templates/text-frontend.inc.vcl.erb and includes a comment that "all WMF IPs (including labs)" is excluded from the cluster_fe_ratelimit test. This exclusion is done by checking for inclusion in the 'wikimedia_nets' acl which is defined in modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb. The CIDR ranges there are gathered from scope.lookupvar('::network::constants::all_networks_lo') which is defined in modules/network/manifests/constants.pp as flatten([$all_networks, '127.0.0.0/8', '::1/128']). $all_networks is flatten([$external_networks, '10.0.0.0/8']). That's where the new eqiad1-r Cloud VPS range ('172.16.0.0/21') is missed.

Could a change to coming from a 172 address have effected ratelimit whitelisting?

Almost certainly. Which means the most prudent course of action is for the app to lower the rate of requests.

Yes. The VCL code that performs the rate limiting is in modules/varnish/templates/text-frontend.inc.vcl.erb and includes a comment that "all WMF IPs (including labs)" is excluded from the cluster_fe_ratelimit test. This exclusion is done by checking for inclusion in the 'wikimedia_nets' acl which is defined in modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb. The CIDR ranges there are gathered from scope.lookupvar('::network::constants::all_networks_lo') which is defined in modules/network/manifests/constants.pp as flatten([$all_networks, '127.0.0.0/8', '::1/128']). $all_networks is flatten([$external_networks, '10.0.0.0/8']). That's where the new eqiad1-r Cloud VPS range ('172.16.0.0/21') is missed.

For what is worth, this is being changed in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475714/ which is related. It will not solve this issue (the actual solution is to respect the 429s) just noting for completeness sake

Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of wikimedia_nets was meant to include labs, whereas its nearly identical wikimedia_trust is meant to exclude labs.

Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of wikimedia_nets was meant to include labs, whereas its nearly identical wikimedia_trust is meant to exclude labs.

If I have understood the end intent correctly, with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475714/ this now happens as the new var is realm based allowing wikimedia_nets to include wmcs space when in wmcs (but not in production). wikimedia_trust now no longer has wmcs when in production, but does have wmcs space when in wmcs, which I assume is fine.

It's a confusing set of things going on here, and it's going to need fixups on both the network/data/data.yaml side and the VCL side. Just to recap the historical situation for clarity:

The network data's $external_networks had all WMF public spaces (prod and labs):

network::external:
- 91.198.174.0/24
- 208.80.152.0/22
- 2620:0:860::/46
- 198.35.26.0/23
- 185.15.56.0/22
- 2a02:ec80::/32
- 2001:df2:e500::/48
- 103.102.166.0/24

Then the network module also defined these constants derived from the above:

$all_networks = flatten([$external_networks, '10.0.0.0/8'])
$all_networks_lo = flatten([$all_networks, '127.0.0.0/8', '::1/128'])

Then the VCL defines two ACLs wikimedia_nets and wikimedia_trust based based on the above:
(note that wikimedia_trust manually excludes the labs 10.68/16 subnet at the VCL level)

acl wikimedia_nets {
<% scope.lookupvar('::network::constants::all_networks_lo').each do |entry|
        subnet, mask = entry.split("/", 2)
-%>
        "<%= subnet %>"/<%= mask %>;
<% end -%>
}
acl wikimedia_trust {
<% scope.lookupvar('::network::constants::all_networks_lo').each do |entry|
        subnet, mask = entry.split("/", 2)
-%>
        "<%= subnet %>"/<%= mask %>;
<% end -%>
<% if @realm == "production" -%>
        ! "10.68.0.0"/16; # temporary hack, do not treat labs like production
<% end -%>
}

wikimedia_nets is used functionally for two things: as a list of networks that external 3rd parties aren't allowed to set X-F-F headers for (so they can't fake they're prod or labs), and as a list of places we don't ratelimit requests from (prod and labs sources, even for production caches). In the production realm, it has to include all the address spaces of both production and labs.

wikimedia_trust on the other hand, is server IPs in production we trust to be able to do trust-requiring things, and thus shouldn't include labs space when evaluated in production. In the production realm, it should list only production address spaces (which it does an imperfect job of today: it excludes the 10.68/16 instance IPs, but not the labs public ranges like it probably should've).

I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to solve it?

I'm not sure to fully understand the technical explanation. Is the problem confirmed? If "yes", what is the plan to solve it?

Yes, we believe the issue has been confirmed, however, regardless of any other corrective action we might take, I would strongly advise to amend the application to respect the 429/rate limit answers and implement some backoff mechanism in mwoffliner in order to not violate the threshold.

bd808 renamed this task from Difficulties to create offline version of Wikipedia because of HTTP 429 response to Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response).Jan 21 2019, 8:12 PM
bd808 updated the task description. (Show Details)
bd808 added subscribers: Nemo_bis, faidon, Cyberpower678, aborrero.

@bd808 just invited me here. Ever since the Cloud VPS migration, Cyberbot has been hitting that rate limit rather frequently, despite its query rates being nowhere near the allowed rate limit, unless I have a rogue process somewhere that I'm not seeing. This only tells me that whatever IP range I am sharing is being blanket blocked, blocking everything behind it.

Cyberbot is using ancient code, and is due for an overhaul, so at this point reworking it's current code just to idle and wait for an opportunity to edit while everyone else is slamming the API, probably not even realizing they're contributing to the rate limit, seems a bit unreasonable. With that being said Cyberbot does perform some valued tasks on Enwiki, and the new framework I am building for it will have more graceful error handling. Right now however, these issues have caused my bots to suffer from rather serious random malfunctions, including not even being able to respect its own run pages.

Cyberpower678 raised the priority of this task from Medium to High.Jan 21 2019, 8:25 PM

I'm also boldly raising the priority as from what I gather I'm likely one of many, many of which aren't even aware of what's happening and why.

Change 488445 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: Move evaluation of wikimedia_trust/nets to puppet

https://gerrit.wikimedia.org/r/488445

This patch doesn't seem to actually fix the issue, just restructure some of the code to make it seemingly easier to exempt Cloud VPS/Toolforge IPs?

Change 488516 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] varnish: Add new WMCS IP space as trusted

https://gerrit.wikimedia.org/r/488516

I 've added the capacity to varnish puppet code to augment the wikimedia_trust and wikimedia_nets constructs, followed by a patch adding the new WMCS IP space to wikimedia_nets in order to exempt that IP space from rate limiting. @BBlack lemme know what you think.

Question, when will this patch go live?

Question, when will this patch go live?

It needs to go through some review in case the approach is wrong. I 'll try and expedite that.

Change 488445 merged by Alexandros Kosiaris:
[operations/puppet@production] Move evaluation of wikimedia_trust/nets to puppet

https://gerrit.wikimedia.org/r/488445

Change 489999 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Revert "Revert "Move evaluation of wikimedia_trust/nets to puppet""

https://gerrit.wikimedia.org/r/489999

Change 489999 merged by Alexandros Kosiaris:
[operations/puppet@production] Revert "Revert "Move evaluation of wikimedia_trust/nets to puppet""

https://gerrit.wikimedia.org/r/489999

Change 488516 merged by Alexandros Kosiaris:
[operations/puppet@production] varnish: Add new WMCS IP space as trusted

https://gerrit.wikimedia.org/r/488516

Change has been deployed across the fleet. WMCS IP space 172.16.0.0/12 should now be exempt of rate limiting rules. @Cyberpower678, @Kelson could you please confirm?

Note that when T209011 is concluded, the exemption will have to be amended and contain the public IP space for WMCS.

Not hitting anymore Varnish error messages. Cyberbot's operation also appears to be stable now.

@akosiaris Problem seems to be fixed from our end. Thank you very much.

bd808 assigned this task to akosiaris.