There have been pages from thumbor.svc (and recoveries shortly afterwards) lately, digging further this seems to be caused by an influx of long-running requests to thumbor locking up all available instances, recapping:
- Influx of "heavy" requests hits thumbor
- All thumbor instances are busy processing the request
- nginx can't find a free thumbor instance where to send requests and emits "502 bad gateway"
- pybal depools the affected thumbor machine, if too many machines happen to be busy the depool threshold is hit
- At the same time also icinga checks thumbor.svc and gets back a 502 from the remaining machines that can't be depooled anymore
- Things eventually recover, either because the long-running requests are successful and thumbs are now in swift or fail and the failed request rate limit kicks in