Page MenuHomePhabricator

MWoffliner scrapes slowed down by Thumbor failure throttling 429s
Open, MediumPublicBUG REPORT

Description

The following request returns always a HTTP 429 response error:

$ curl -I "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/Yeongnamroad2.png/220px-Yeongnamroad2.png"
HTTP/2 429 
date: Mon, 28 Mar 2022 08:34:43 GMT
server: Varnish
x-cache: cp3065 int
x-cache-status: int-front
server-timing: cache;desc="int-front", host;desc="cp3065"
strict-transport-security: max-age=106384710; includeSubDomains; preload
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
set-cookie: WMF-Last-Access=28-Mar-2022;Path=/;HttpOnly;secure;Expires=Fri, 29 Apr 2022 00:00:00 GMT
content-type: text/html; charset=utf-8
content-length: 1843

This, wathever the system I launch the HTTP request from. This HTTP error code should be responded only when too many requests are sent from the requesting system, but here it seems to appear in place of a HTTP 5XX error.

This is a really serious problem for the openZIM/Kiwix project, because the scraper MWoffliner slows down each time it get this error code to accomodate the backend. But if this kind of error always comes then this transforms a normal scrape in an endless experience!

Event Timeline

429 is returned when the thumbnail hits one of four ratelimits (see https://wikitech.wikimedia.org/wiki/Thumbor#Throttling). That includes a ratelimit on requests for thumbnails that have recently failed to generate. 429 is used because the request should not be retried at that point. This task is largely a duplicate of T175512.

Would it help if a Retry-After: 3600 header was set on failure throttling?

AntiCompositeNumber renamed this task from Unjustified HTTP 429 responses lead to "endless" Wikipedia scrapes to MWoffliner scrapes slowed down by Thumbor failure throttling 429s.Mar 28 2022, 2:37 PM

The actual failure for this thumbnail is

ImageMagickException: Failed to convert image convert: IDAT: invalid distance too far back `/tmp/tmp5SM7vX' @ error/png.c/MagickPNGErrorHandler/1628.

which is easily fixed with pngfix. I have done so, the file is now thumbnailed properly.

429 is returned when the thumbnail hits one of four ratelimits (see https://wikitech.wikimedia.org/wiki/Thumbor#Throttling). That includes a ratelimit on requests for thumbnails that have recently failed to generate. 429 is used because the request should not be retried at that point. This task is largely a duplicate of T175512.

@AntiCompositeNumber Thank you for the clarification. It seems related to T175512 indeed, even if my ticket speaks specificaly about the client facing HTTP error code (and not about the human readable message like T175512). I have read the following comment https://phabricator.wikimedia.org/T175512#3595459 in T175512 and tend to disagree, if something goes wrong in the backend, the error code should be in the HTTP 5xx (HTTP 4xx errors are for client errors). But, even if we agree with that, what is sure is that it can not be that a random final user, after one request, get such an error (so: maybe OK within the Wikimedia infrastructure where forwarded requests are aggregated, but NOK to forward this to any client).

Regarding your workaround, I have no strong opinion. How should we make the difference beetween legitimate HTTP 429 (too many requests coming from the scraper) and not legit? All of this seems a bit weak by design.

herron triaged this task as Medium priority.Mar 28 2022, 4:27 PM

But, even if we agree with that, what is sure is that it can not be that a random final user, after one request, get such an error

Of the 4 Thumbor throttles, only 1 is per-IP address. The other three are based on the original file (failure or concurrency) or filetype. RFC 6585 explicitly does not define how users or requests should be counted. We've also used 429 with "too many" being 1 elsewhere in the Wikimedia infrastructure, though that's largely been replaced with 403s for media at least. Using 503 (with or without Retry-After) would be an option, but I don't really see it as necessarily better than 429.

Of the 4 Thumbor throttles, only 1 is per-IP address. The other three are based on the original file (failure or concurrency) or filetype. RFC 6585 explicitly does not define how users or requests should be counted. We've also used 429 with "too many" being 1 elsewhere in the Wikimedia infrastructure, though that's largely been replaced with 403s for media at least. Using 503 (with or without Retry-After) would be an option, but I don't really see it as necessarily better than 429.

I'm sorry to write it so bluntly, but here we have many things done wrongly IMO:

  • Using a 4xx error range for a server side error
  • Using 1 as "too-many" for 429 scenario
  • Forwarding internal purpose response details to final HTTP client (and then losing its pertinence)

The result is that the HTTP client is fooled by the HTTP error code. The backend misguides it to believe it does something wrong where actually it does not. Doing so, the HTTP client has no way to know what is wrong and act accordingly based on a normal/common and simple understanding of the RFCs. Or do you see one?