Page MenuHomePhabricator

Thumbor's use of the `expensive` poolcounter queue can break rendering formats
Open, HighPublic

Description

This is a hard issue to debug as we only see it in production outages. The expensive key is a configuration option that we use to maintain a poolcounter queue to ensure that we don't get overwhelmed by too many requests for expensive file operations on things like stl files and pdfs. This is necessary as we've seen DoS-like behaviour when too many files are accessed by scrapers and the like.

However, we have had to disable this feature in situations like T376534 and T372470#10203863 as, under some conditions (unclear what they are but it appears to happen after longer periods of time with this feature enabled) locks accumulate and all requests to poolcounter result in TIMEOUT for a particular file. This could be related to issues alluded to in T338297 or some other issue. So far we *only* see this happening for the expensive queue, but that is not entirely surprising as we're much more likely to hit edge behaviours for something locked as often as a file format compared to a single IP address or a single file, so it should be understood to be a larger issue with our poolcounter implementation rather than something specific to the expensive queue.

We need to have the expensive limiting feature in place for the service's safety, but the current feature is too broken to leave enabled as it ends up resulting in outages which negatively impacts the community for long periods of time.

T376538 is an attempt to limit the impact of this bug (but is necessary in general).