We've had a few cascading failures recently on the upload caches @ eqsin. The basic shape of the event (although not all details are perfectly clear!) is:
- A very large number of external requests spam the cluster for a particular image filename. In known examples, due to some popular mobile app loading the image for all clients on startup.
- The image's size in bytes is greater than the upload cluster's large_objects_cutoff, which means the frontend caches do not attempt to cache (they set up a hit-for-pass object after the first response comes through and gives it size info) and rely on the (much larger) backend cache layer to handle it.
- All the frontends forward the requests through to one backend-layer cache (because the backend layer is chashed on URI), without coalescing (because pass mode doesn't coalesce), overwhelming it.
- When the first frontend becomes unresponsive, pybal depools it from service, which increases the pressure on the remaining frontends.
- This continues (as we remove more frontends, the per-frontend pressure just increases) until half the cluster is depooled thanks to the depool_threshold of 0.5, leaving us in a pretty broken state!
We suspect a number of different design and/or code -level issues and/or perf oversights are in play here. These are some of the ones on our mind in various software layers, in roughly decreasing order of importance:
- The current value of large_objects_cutoff is too small for commonly-linked images. This is not easily analyzed or fixed without looking at some other complex considerations about the total cache size, object size distributions, and varnish's eviction behavior and the ways it can fail, so we'll come back to this in more elsewhere in the ticket I think. The current temporary mitigation touches on this factor, but in a conservative and limited way.
- While the behavior of chashing to the backend layer is extremely useful for increasing the effective total size of our backend cache (we get the sum of all backend cache nodes' disk storage as an effective size), it's problematic under heavy, focused load situations like this. In such a situation we'd be better served by randomizing the requests to all backends just for the hot image in question, if we have a way to make that distinction reasonably-accurately in code! The text cluster actually does randomize all pass-mode requests for this reason, including those passed by its own large_object_cutoffs limit, but the situation is very different there...
- In a case like this where the volume of external traffic is driving node faliures, rather than internal failures (e.g. hardware failures, or random daemon crashes unrelated to heavy traffic), auto-depooling nodes only ever makes things worse, not better. The current pybal behavior is to depool down to its depool threshold and then hold there until the situation improves. In our scenario, this means that we very quickly cut the cluster's capacity in half and then leave it that way until either human intervention or natural recovery.
- The IPVS-layer hashing of traffic to frontend caches is also suspect, in that it may be distributing requests unevenly when many of the overwhelming volume of clients are on some of the same mobile networks, and this may also persist or get worse as pybal removes more nodes from the set. We're currently using sh on the client IP only (not the port, because this would defeat our assumptions of all traffic for one IP going to one frontend, which we rely on for e.g. ratelimiting), and AFAICS this devolves to an extremely simplistic scheme: The currently-pooled nodes are filled into a 256-entry array in-order repeatedly like (node0, node1, node2, ..., node0, ...) until it's filled, and then the array slot for a given client IP is chosen by something like xoring the four bytes of the client IP together into a single byte, multiplying that by a constant number (mod 2^8), and then taking that slot of the table.
- We've long intended to share the ATS backend cache layer between both upload and text nodes, which would double their scalability in situations like this, but this hasn't happened yet and isn't trivial to turn on yet, either.
- Our current TLS terminator config uses a single port to forward traffic to the varnish frontend cache on the same machine. Our previous nginx-based terminator was using 8 ports, and this was done specifically to avoid problems created at our connection volume with TIME_WAIT sockets and 16-bit port number exhaustion on these local inter-daemon connections, and this may be a factor in this scenario as well, as connection parallelism is unusually high during this kind of incident.
- Also lost in the TLS terminator transition was our NUMA optimization. While the OS-level bits for the NIC are set up to distribute RX IRQs only to the NUMA node the card is directly attached to, our current TLS terminator (which receives all of the direct traffic from the NIC) isn't bound to run only that same NUMA node the way our previous nginx solution was. This is probably a minor inefficiency, but it can't hurt and is relatively-easy to fix.
Separately from all of the software-level concerns, we could stand to expand the cache hardware footprint in eqsin. eqsin and ulsfo are currently set up with 6 nodes per cluster (text and upload, for a total of 12 nodes), while all other sites are set up as 8 nodes per cluster, giving them 33% more capacity in general. At today's eqsin traffic levels, it doesn't make sense to have a reduced configuration there, and honestly we probably shouldn't ever deploy less than 8 nodes per cluster in general, because it gets too problematic too quickly with planned and unplanned node depools and the impacts of the initial runtime depools by pybal, etc. In the long run we might even expand on this for all sites at their next refreshes, but whether and how we do that is well outside the scope here in this ticket.
Solutions discussions for all the above to branch out in this ticket or subtickets...