Page MenuHomePhabricator

Chaos Engineering - Stop for x hours one or more mc10xx memcached shards
Closed, ResolvedPublic

Description

Since we are now using a Gutter pool, it would be great to set up a controlled outage to verify the following:

  • since mc-gp100[1-3] (the gutter pool hosts) are running Buster with Memcached 1.5.x, it would be great to see if settings are ok and if it behaves as we expect (memcached changes a lot from our version, 1.4.x, to 1.5.x)
  • mcrouter's behavior during a long outage

The idea is to verify that our infrastructure is now resilient to shards going down without live testing it on a Sunday morning :)

If this test goes as expected, we could think about the re-image of one memcached shard to Buster!

Event Timeline

I think we should run 3 different tests, and I would run them for 1 host first.

  • Stop memcached completely
  • drop all packets directed to port 11211
  • drop a percentage of packets incoming and outgoing
colewhite triaged this task as Medium priority.May 5 2020, 4:21 PM

@elukey let's schedule this test for 6:00Z on monday, May 11th?

@elukey let's schedule this test for 6:00Z on monday, May 11th?

+1

Mentioned in SAL (#wikimedia-operations) [2020-05-11T07:08:59Z] <_joe_> dropping requests to mc1020 via a firewall rule T251378

Mentioned in SAL (#wikimedia-operations) [2020-05-11T08:30:03Z] <_joe_> removing the iptables DROP rule on mc1020 T251378

Joe claimed this task.

We ran this test, and it passed with flying colors:

  • A transient peak of memcached errors, lasting less than 1 minute
  • The gutter pool picks up the slack pretty fast
  • No noticeable effect on latency.
  • The cache hit ratio on the gutter pool was good (88% after less than one hour in the pool, but probably capped around that value by the 10 minutes TTL)
  • As soon as the server became available again, the memcached traffic went back quickly but not instantly, in the span of ~ 2 minutes. This also eases the risk of thundering herds from the deletes that get replayed.