Page MenuHomePhabricator

429 Too Many Requests hit despite throttling to 100 req/sec
Open, NormalPublic

Description

Sometime recently (I want to say it's recent), https://tools.wmflabs.org/massviews sometimes gets 429 responses from the pageviews API. Each request is separated by 10ms, which should mean it would never exceed the 100 req/sec limit, as indicated at https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_per_article_project_access_agent_article_granularity_start_end.

Did something change recently? Or perhaps I'm doing something wrong?

For reference, here is the code I use to add rate-limiting: https://github.com/MusikAnimal/pageviews/blob/b90732a6e3329b3caaf89337237463c21dc5ec00/javascripts/shared/pv.js#L1369-L1402. fn here would be the promise to actually make the request to the API.

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptApr 2 2019, 8:57 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fdans added a subscriber: fdans.Apr 4 2019, 5:16 PM

@MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself?

fdans triaged this task as Normal priority.Apr 4 2019, 5:17 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

@MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself?

Yes, it was reported at meta:Talk:Pageviews Analysis. The user only sees "Error occurred when querying Pageviews API - Unknown". This is not at all uncommon when you give Massviews a large set of pages, but when I've checked it in the past it was always the 404 gotcha, where they were all obscure pages that evidently hadn't been viewed since the pageviews API was introduced (so 404 in that case means 0 pageviews). When I investigated the aforementioned report, I saw that for some pages, the response was 429. If you read the bug report, they are saying that when they try a second time the pageviews will successfully be fetched for those pages, which makes sense when reading the gotcha for 429:

429 throttling
Client has made too many requests and it is being throttled. This will happen if the storage cannot keep up with the request ratio from a given IP. Throttling is enforced at the storage layer, meaning that if you request data we have in cache (cause other client has requested it earlier) there is no throttling. Throttling will be enabled late May 2016.

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

The problem here is that I am conforming to the 100 req/sec throttling, which in theory means we shouldn't get 429s in the first place.

Nuria added a subscriber: Nuria.Apr 8 2019, 4:50 PM

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

This is not a correct assumption, requests that received a 429 are not fetched from storage, rather they are rejected before being processed.

A second run succeeds (likely) cause "some" requests were processed on the first run (they did not get 429s) and thus are now cached and the request batch that hits storage is smaller for the same client side request. Makes sense?

I am conforming to the 100 req/sec throttling

Throttling happens per IP. A user with two tabs open in your case can send 100 reqs per sec per tab, correct? If so while it is good that some rate limiting code exists in the tool, it is easy to bypass it.

So on the second try, the API can serve from cache for most of the pages. It only has to pull from storage for the pages that got a 429 on the first run.

This is not a correct assumption, requests that received a 429 are not fetched from storage, rather they are rejected before being processed.
A second run succeeds (likely) cause "some" requests were processed on the first run (they did not get 429s) and thus are now cached and the request batch that hits storage is smaller for the same client side request. Makes sense?

Yes that is what I meant. One the second try we only pull from storage for pages that received a 429 on the first run. The remaining pages are cached, as you say.

I am conforming to the 100 req/sec throttling

Throttling happens per IP. A user with two tabs open in your case can send 100 reqs per sec per tab, correct? If so while it is good that some rate limiting code exists in the tool, it is easy to bypass it.

I can't speak for the user who reported the error, but in my testing I was only using one tab. The example used for testing was https://tools.wmflabs.org/massviews/?platform=all-access&agent=user&source=category&target=https%3A%2F%2Fsv.wikipedia.org%2Fwiki%2FKategori%3ANaturreservat_i_Sverige&start=2018-10-01&end=2019-03-31&subjectpage=0&subcategories=1&sort=views&direction=1&view=list (check the network log in the developer console). You may or may not actually get 429s, if you don't I suppose you could wait however long it takes for the cache to expire and try again. There will probably be some 404s in there too (zero pageviews).

MusikAnimal updated the task description. (Show Details)Jul 9 2019, 3:24 AM
MusikAnimal updated the task description. (Show Details)

@Nuria @fdans Now I see "HyperSwitch request rate limit exceeded" (before it was 429s without a message), despite making no more than the maximum 100 req/sec. This starts happening only after so many thousand requests are made in succession. It seems like it's some sort of DDoS prevention, because every request returns 429, when at least some should go through if it was only enforcing the 100 req/sec limit. In the case of Massviews, we could be querying for up to 20,000 pages, which comes out to about 3.3 minutes straight of making requests in 10ms intervals.

The issue has been going on since sometime in early 2019, perhaps earlier. Before then Massviews was able to run without any errors (apart from 404s).

How can I avoid the 429s?

Nuria added a comment.Jul 10 2019, 5:55 AM

when at least some should go through if it was only enforcing the 100 req/sec limit.

Let's see, ratelimiting is enforced per IP for public APIs, once you go over the limit of what we think is sustainable your iP will be throttled for a bit (limit enforcing does not automatically stop when you stop making connections but a bit after), so there is no guarantee that 100 of your connections per sec are going to make it once you go above that limit. These (to be clear) are connections from the browser correct?

This starts happening only after so many thousand requests are made in succession.

Right, the volume of requests hitting the server must be >100 reqs per sec, that can happen even if your client is sending a number of requests close to that limit but a bit below.

I think this is the right code, please see: https://github.com/wikimedia/limitation

Thanks. I'm thinking what I might do is when I hit the first 429, make it pause for a bit before resuming making more requests. Or I could just increase the timeout between requests. Both seem like hacky, sub-par solutions; I can try to dig through wikimedia/limitation to see what the exact logic is, and try to go by it, to ensure my tool goes as fast as it can. Massviews is used for GLAM, outreach, etc., where by nature there will be thousands of pages to lookup.

These (to be clear) are connections from the browser correct?

Yes, all from the browser.

I've been hitting this problem consistently with the Massviews tool (only using 1 tab). I wonder if slightly tweaking the 10ms pause would fix it. Maybe we could try changing it to 12ms and see if that makes the difference, as currently we're surfing right on the edge of the throttle.

Nuria added a comment.Jul 23 2019, 9:04 PM

Given that this tool for links like the one above (see couple comments up) does 5000 requests on 1 tab (see network panel for chrome) it is unlikely to work even if you "space" requests a bit more. The tool, to work best, needs an entirely different api that is, say, category-based and not page-based.

In the absence of an API that is more taylored to your use case you can manage the queue of requests. For example, you can send N request (the browser will multiplex) and when the first one gets a 429 you stop , message user on UI and continue some time after. So user will get data in stages.

you can send N request (the browser will multiplex) and when the first one gets a 429 you stop , message user on UI and continue some time after. So user will get data in stages

Yeah that's basically my idea; I'm going to implement a retry handler to make it pause continually after each 429, ensuring every page is accounted for. This is what we recently had to do for Popular Pages bot.

However I'd like to reiterate that this wasn't an issue some months ago. For however many years it's been, we were safe to query at 100 req/sec without worry. I still think there might be some issue with the API's throttling logic, because again we are abiding by the advertised rate limit but are still getting 429s.

Nuria added a comment.Jul 23 2019, 9:20 PM

because again we are abiding by the advertised rate limit but are still getting 429s.

Ok, maybe we need to look a this a bit more but in any case the best way to approach these massive number of requests is in stages.

This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?

MusikAnimal added a comment.EditedJul 25 2019, 10:23 PM

This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?

Probably April, or at least sometime in 2019... I'm not sure :( This situation has certainly grown worse in recent months.

I'll note that most of the 429s have an empty response body, with a Retry-After header of 1 second. More recently (late June / early July), the 429s *sometimes* have a JSON response with the error message "HyperSwitch request rate limit exceeded", and no Retry-After header. So it seems like there are two things at play here. I'm just fairly certain there weren't 429s at all for most of Massview's life. When I developed it, I made sure it ran at 100 req/sec, and I only ever saw 404s (meaning zero pageviews).

Hopefully this is helpful. If it means anything, the Popular Pages bot for instance (which also does mass querying) still goes impressively fast despite having to pause for the 429s. Massviews just doesn't have the same kind of retry handler implemented, which I'm going to add. When I do, I suspect it will be of satisfactory speed for the users. My point with this task is that those 429s weren't a problem (or as much of a problem) before with the current, long-standing Massviews implementation, and we're not exceeding 100 req/sec.

Thanks for looking into it!

So, looked into code history more carefully. There's literally one code change in AQS in 2019, and it doesn't touch pageviews handling at all. npm saw fit to update some of the repository references for kad, swagger-ui, and json-stable-stringify. I suppose we could look into those but that would be pretty crazy bad luck. I think the logical next place to look is the layer in front of AQS, the problem is 99% likely to be from there. Pinging @Pchelolo to see if this sounds familiar. Petr, basically we're seeing a lot more 429s since around April 2019, and we see two different kinds:

  • with a Retry-After header of 1 second
  • JSON response with the error message "HyperSwitch request rate limit exceeded"

Searching the hyperswitch repo shows this has been in place for 3 years. Any idea what changed around it? It's possible it's behaving as designed, just trying to understand exactly what's going on so we can maybe set expectations.

NOTE: I'm getting this weird deja-vu feeling like I bothered Petr about this before, sorry if I forgot something obvious.