Users are seeing 429 errors for ~9 partners' logos. Presumably this is due to a large number of remote clients being lumped together behind NAT. An increase in the limit for wikipedialibrary.wmflabs/media/.* should have a minimal impact on our site performance & reliability.
Description
Event Timeline
Can you open up one of those error urls and screenshot the response so we can we the error page?
Can you copy the response headers for that request out of dev tools and post them? Clean out any session data first!
Could you clarify what you mean by this? The URLs for the individual logos (e.g. https://wikipedialibrary.wmflabs.org/media/Screenshot_2021-07-17_at_21-49-27_eLibrary_Klett-Cotta_Verlag.png) work fine, they just don't load at https://wikipedialibrary.wmflabs.org/users/my_library/.
Can you copy the response headers for that request out of dev tools and post them? Clean out any session data first!
How do I do this? :)
Here's a quick how-to for getting to them:
https://code2care.org/howto/see-http-request-response-headers-google-chrome
You should be able to copy the info out from the headers panel.
Could you clarify what you mean by this? The URLs for the individual logos (e.g. https://wikipedialibrary.wmflabs.org/media/Screenshot_2021-07-17_at_21-49-27_eLibrary_Klett-Cotta_Verlag.png) work fine, they just don't load at https://wikipedialibrary.wmflabs.org/users/my_library/.
You have the right idea, but it sounds like it's not going to work.
content-length: 571 content-type: text/html date: Wed, 16 Nov 2022 13:41:02 GMT server: nginx/1.18.0
Worth noting that when I opened an incognito window to get this the error only occurred after I refreshed the page a few times.
Helpful! This verifies that we need to adjust our nginx rate limiter. This should be a very straightforward fix.
Now that Cloud VPS does its own rate limiting, the nginx rate limiter is completely redundant. The solution here is to stop rate limiting.
https://github.com/WikipediaLibrary/TWLight/pull/1160
Interestingly this still appears to be happening for me after a few cache-clearing refreshes - I'm still seeing 429 responses from a number of logos.
I forgot that nginx-only changes don't get picked up by our deployment script. I just manually nudged it. Try again?
hmm. Can you open one of those error image urls in a new tab and let me know if the error page is a library error page? Maybe post a screenshot?
Not an error, loads as expected, e.g. https://wikipedialibrary.wmflabs.org/media/MERKUR_2021_Band_75_Heft_866.png
It seems to happen every time I do a cache-clearing refresh, but not if I do a normal refresh.
I think that's the web proxy rate limit. We're on nginx 1.23.4. We'll need to request some adjustment to our web proxy.
WMCS does not currently have a facility to alter the shared proxy rate limit per backend host or request path. The current rate limit that @taavi setup with https://gerrit.wikimedia.org/r/c/operations/puppet/+/830932 is intended to be 100 requests per second per ip with burst=100 and nodelay. See https://www.nginx.com/blog/rate-limiting-nginx/ for more on the underlying mechanism.
@Samwalton9 can you give me an exact URL/set of URLs to access with a cold cache that triggers the 100 rps limit for you? I'm not specifically interested in the image URLs, but the embedding pages that load enough images to hit the limit. I would like to attempt to determine if nginx is working the way we expect it to or not.
We really should customize the error pages from the front proxy to make debugging layered nginx responses easier for everyone.
Log in at https://wikipedialibrary.wmflabs.org/, then at https://wikipedialibrary.wmflabs.org/users/my_library/ you should find that the final handful of logos in the 'Available collections' tab don't load.
@Samwalton9 are you also seeing it at https://wikipedialibrary.wmflabs.org/partners/ ?
I can't reproduce with Firefox. All of the images load, although it does take a few seconds.
Have you considered lazy loading those images instead?
PR https://github.com/WikipediaLibrary/TWLight/pull/1161
I've also pushed this to staging, but we'll need to wait a bit for that to deploy.
I was able to trigger some 429 responses there, but only by doing multiple hard refreshes (⌃+⇧+r in Firefox) of the page in a row to ramp up the backend request rate.
Hopefully that change will help reduce the potential for triggering the rate limit. The tests I did to recreate were against an instance that did not yet have the loading="lazy" attribute on the images so each hard reload requested ~70 images from the backend just as fast as the html could be parsed and the requests dispatched. I think with the viewport size I'm using lazy loading would have only immediately requested ~7 images (a 90% savings). Even very large viewports should see a significant reduction in image request rate with that change.