Chaos Engineering - Stop for x hours one or more mc10xx memcached shards
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Apr 29 2020, 8:57 AM

Description

Since we are now using a Gutter pool, it would be great to set up a controlled outage to verify the following:

since mc-gp100[1-3] (the gutter pool hosts) are running Buster with Memcached 1.5.x, it would be great to see if settings are ok and if it behaves as we expect (memcached changes a lot from our version, 1.4.x, to 1.5.x)
mcrouter's behavior during a long outage

The idea is to verify that our infrastructure is now resilient to shards going down without live testing it on a Sunday morning :)

If this test goes as expected, we could think about the re-image of one memcached shard to Buster!

		Status	Subtype	Assigned	Task
		Resolved		None	T244852 Upgrade and improve our application object caching service (memcached)
		Resolved		Joe	T251378 Chaos Engineering - Stop for x hours one or more mc10xx memcached shards

I think we should run 3 different tests, and I would run them for 1 host first.

colewhite triaged this task as Medium priority.May 5 2020, 4:21 PM

@elukey let's schedule this test for 6:00Z on monday, May 11th?

In T251378#6111482, @Joe wrote:

@elukey let's schedule this test for 6:00Z on monday, May 11th?

Mentioned in SAL (#wikimedia-operations) [2020-05-11T07:08:59Z] <_joe_> dropping requests to mc1020 via a firewall rule T251378

Mentioned in SAL (#wikimedia-operations) [2020-05-11T08:30:03Z] <_joe_> removing the iptables DROP rule on mc1020 T251378

We ran this test, and it passed with flying colors:

A transient peak of memcached errors, lasting less than 1 minute
The gutter pool picks up the slack pretty fast
No noticeable effect on latency.
The cache hit ratio on the gutter pool was good (88% after less than one hour in the pool, but probably capped around that value by the 10 minutes TTL)
As soon as the server became available again, the memcached traffic went back quickly but not instantly, in the span of ~ 2 minutes. This also eases the risk of thundering herds from the deletes that get replayed.