Page MenuHomePhabricator

cp_upload @ eqsin cascading failures, February 2021
Open, HighPublic


We've had a few cascading failures recently on the upload caches @ eqsin. The basic shape of the event (although not all details are perfectly clear!) is:

  1. A very large number of external requests spam the cluster for a particular image filename. In known examples, due to some popular mobile app loading the image for all clients on startup.
  2. The image's size in bytes is greater than the upload cluster's large_objects_cutoff, which means the frontend caches do not attempt to cache (they set up a hit-for-pass object after the first response comes through and gives it size info) and rely on the (much larger) backend cache layer to handle it.
  3. All the frontends forward the requests through to one backend-layer cache (because the backend layer is chashed on URI), without coalescing (because pass mode doesn't coalesce), overwhelming it.
  4. When the first frontend becomes unresponsive, pybal depools it from service, which increases the pressure on the remaining frontends.
  5. This continues (as we remove more frontends, the per-frontend pressure just increases) until half the cluster is depooled thanks to the depool_threshold of 0.5, leaving us in a pretty broken state!

We suspect a number of different design and/or code -level issues and/or perf oversights are in play here. These are some of the ones on our mind in various software layers, in roughly decreasing order of importance:

  1. The current value of large_objects_cutoff is too small for commonly-linked images. This is not easily analyzed or fixed without looking at some other complex considerations about the total cache size, object size distributions, and varnish's eviction behavior and the ways it can fail, so we'll come back to this in more elsewhere in the ticket I think. The current temporary mitigation touches on this factor, but in a conservative and limited way.
  2. While the behavior of chashing to the backend layer is extremely useful for increasing the effective total size of our backend cache (we get the sum of all backend cache nodes' disk storage as an effective size), it's problematic under heavy, focused load situations like this. In such a situation we'd be better served by randomizing the requests to all backends just for the hot image in question, if we have a way to make that distinction reasonably-accurately in code! The text cluster actually does randomize all pass-mode requests for this reason, including those passed by its own large_object_cutoffs limit, but the situation is very different there...
  3. In a case like this where the volume of external traffic is driving node faliures, rather than internal failures (e.g. hardware failures, or random daemon crashes unrelated to heavy traffic), auto-depooling nodes only ever makes things worse, not better. The current pybal behavior is to depool down to its depool threshold and then hold there until the situation improves. In our scenario, this means that we very quickly cut the cluster's capacity in half and then leave it that way until either human intervention or natural recovery.
  4. The IPVS-layer hashing of traffic to frontend caches is also suspect, in that it may be distributing requests unevenly when many of the overwhelming volume of clients are on some of the same mobile networks, and this may also persist or get worse as pybal removes more nodes from the set. We're currently using sh on the client IP only (not the port, because this would defeat our assumptions of all traffic for one IP going to one frontend, which we rely on for e.g. ratelimiting), and AFAICS this devolves to an extremely simplistic scheme: The currently-pooled nodes are filled into a 256-entry array in-order repeatedly like (node0, node1, node2, ..., node0, ...) until it's filled, and then the array slot for a given client IP is chosen by something like xoring the four bytes of the client IP together into a single byte, multiplying that by a constant number (mod 2^8), and then taking that slot of the table.
  5. We've long intended to share the ATS backend cache layer between both upload and text nodes, which would double their scalability in situations like this, but this hasn't happened yet and isn't trivial to turn on yet, either.
  6. Our current TLS terminator config uses a single port to forward traffic to the varnish frontend cache on the same machine. Our previous nginx-based terminator was using 8 ports, and this was done specifically to avoid problems created at our connection volume with TIME_WAIT sockets and 16-bit port number exhaustion on these local inter-daemon connections, and this may be a factor in this scenario as well, as connection parallelism is unusually high during this kind of incident.
  7. Also lost in the TLS terminator transition was our NUMA optimization. While the OS-level bits for the NIC are set up to distribute RX IRQs only to the NUMA node the card is directly attached to, our current TLS terminator (which receives all of the direct traffic from the NIC) isn't bound to run only that same NUMA node the way our previous nginx solution was. This is probably a minor inefficiency, but it can't hurt and is relatively-easy to fix.

Separately from all of the software-level concerns, we could stand to expand the cache hardware footprint in eqsin. eqsin and ulsfo are currently set up with 6 nodes per cluster (text and upload, for a total of 12 nodes), while all other sites are set up as 8 nodes per cluster, giving them 33% more capacity in general. At today's eqsin traffic levels, it doesn't make sense to have a reduced configuration there, and honestly we probably shouldn't ever deploy less than 8 nodes per cluster in general, because it gets too problematic too quickly with planned and unplanned node depools and the impacts of the initial runtime depools by pybal, etc. In the long run we might even expand on this for all sites at their next refreshes, but whether and how we do that is well outside the scope here in this ticket.

Solutions discussions for all the above to branch out in this ticket or subtickets...

Event Timeline

CDanis created this task.Tue, Feb 16, 2:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Feb 16, 2:45 PM

Change 664581 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Allow upload frontends to cache objects up to 4MiB

Change 664581 merged by CDanis:
[operations/puppet@production] Allow upload frontends to cache objects up to 4MiB

Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:12:12Z] <cdanis> previous message was re: T274888

Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:27:00Z] <cdanis> re-enabling Puppet on cp-upload@eqsin to deploy Iab4d211 T274888

MoritzMuehlenhoff triaged this task as High priority.Wed, Feb 17, 10:23 AM

Change 666838 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable parent proxies in cp5006

Change 666838 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable parent proxies in cp5006

Mentioned in SAL (#wikimedia-operations) [2021-02-25T08:15:26Z] <vgutierrez> restart ats-tls on cp5006 to enable parent proxies support - T274888

Change 666871 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Limit NUMA nodes usage on ats-tls

BBlack updated the task description. (Show Details)Thu, Feb 25, 3:12 PM

Updates on where we're at on some of the pain points above, in terms of solution analysis:

  1. large_objects_cutoff and all related things - the key thing that makes tuning this hard is varnish's nuke_limit, which caps the number of objects it will evict to make room for one new object (without any regard to relative sizes). If it can't make room within nuke_limit evictions, the request fails. If there's a large disparity in the minimum and maximum object sizes stored in the cache, large objects could commonly fail because there are too many small objects needing eviction to make room. Setting nuke_limit arbitrarily-large can also hurt (thrashing of cache contents for one-hit-wonder large objects, slow responses), and setting large_objects_cutoff much larger without a nuke_limit increase causes failures. The current eqsin-only mitigation is working, but it will fail to work for slightly-larger images than the last incident. We've also had some more-dynamic policies in the past based on T144187, where instead of having a fixed cutoff, the probability of the object entering the cache decreases as the size increases. This allows us to say something like "don't normally cache rare large objects, but do eventually cache them if they're popular enough". However, the tuning of the parameters for that are now quite outdated, and were also based purely on improving object and/or byte hit-rates, not surviving overwhelming events. The previous tuning gave near-zero probability of cache entry for the sizes in question in our recent scenarios. However, we could use a similar solution that's tuned for this problem rather than for hitrates, maybe (same exponential curve, but with admission probabilities that suit the cases we care about here...). I'm still digging into this.
  2. pass_random stuff - Still needs digging on whether there's a reasonable solution here to avoid the focus of hot traffic on a backend in the upload case, possibly pending on what the solution above looks like
  3. pybal depooling strategy - In the slightly-longer-term, I think it would help to add a different failure mode to pybal where healthcheck-depools can't persistently down a whole group of nodes (something more like "once we reach the depool threshold based on realtime healthchecks, assume the individual nodes are not at fault and re-enable them all"). In the very short term, though, we should probably at least raise the depool threshold to maybe 0.75-ish, which I believe would be one node out for 6-node sites and two nodes for the 8-node sites. It's problematic that there's no distinction between manual depools in confctl and health depools, for this parameter's purpose, though.
  4. sh hashing - I think @CDanis already worked on some patches to transition us to maglev hashing a quarter or two ago, but it wasn't ever deployed to all production LVSes. Need to take a look at the state of affairs on that and see if we're at a point where we can safely deploy it (also, re-check assumptions about whether it relies on a weight=0 depooling strategy?).
  5. Backend caches cluster sharing - Haven't looked into this yet.
  6. Multi-socket TLS->varnish-fe - @Vgutierrez has some related patchwork ongoing in this ticket.
  7. TLS NUMA binding - @Vgutierrez has some related patchwork ongoing in this ticket.
BBlack mentioned this in Unknown Object (Task).Thu, Feb 25, 5:52 PM
BBlack updated the task description. (Show Details)Thu, Feb 25, 7:01 PM
  1. sh hashing - I think @CDanis already worked on some patches to transition us to maglev hashing a quarter or two ago, but it wasn't ever deployed to all production LVSes. Need to take a look at the state of affairs on that and see if we're at a point where we can safely deploy it (also, re-check assumptions about whether it relies on a weight=0 depooling strategy?).

This was always something of a back-burner thing for me and it looks like I only got as far as building a trivially-backported ipvsadm and uploading it to buster-wikimedia -- which only ever got actually deployed on pybal-test2001.

That being said, it ought to be almost trivial to roll this out everywhere (although of course you want to do it slowly, just in case).

I've spun out T275809 to go into some depth on the #1 part about large_objects_cutoff

Change 667121 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable parent proxies on ats-tls at upload@eqsin

Change 667121 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable parent proxies on ats-tls at upload@eqsin