Page MenuHomePhabricator

Test gutter pool failover in production and memcached 1.5.x
Closed, ResolvedPublic

Description

Goals

  • Test if failover works and failover strategies.

First check that mrouter failovers to the gutter servers when a shard becomes unavailable. How FailoverWithExptimeRoute works?

  • Check key integrity during and after a failover

Investigate what happens with the existing keys in a shard that was unavailable and now is back online. We would like to know if it will server stale keys for instance. A way we can test this is by creating keys with either short o

  • Test how LRU behaves in buster

Memcached 1.5.x (buster) has a few changes, including how keys are evicted from memory. We would like to keep one (or more) shard down for a long period of time, have servers failover to the gutter pool ones, gather metrics and compare with our memcached 1.4.x servers.

  • Test 'gutter proxies'

Mediawiki sets/gets some keys with the prefix /*/mw-wan/. Those keys are replicated from the primary to the secondary DC, via mcrouter. To do so, we have defined 4 a set of 4 mcrouter proxies located at the destination. We would like to have an extra set of "gutter proxies" i.e. another 4 mrouter instances, where a mcrouter from the primary DC can failover to if one of the destination proxies is down. Note that each mediawiki server is running one mcrouter instance

Testing Environment

  • mwdebug1001: we have deployed a configuration where we instruct mcrouter to use the gutter pool when a shard fails, config.json: P10383
    • We push iptables rules to block traffic to a specific or all memcached servers from the main pool, so to cause connection errors
  • mc-gp100[1-3]: gutter pool servers aka gutter pool cluster, running memcached 1.5.x version on buster
  • mediawiki-07 (beta): We generate traffic towards mwdebug1001 by going through a list of 90 URLs, 1 req/s

We will be blocking traffic from mwdebug -> mc* and get metrics/data in the following cases:

  • block random shards in random intervals
  • block a shard for a long amount of time (eg 1 hour, 2 hous, 2 days)
  • block a shards for a long amount of time (eg 1 hour, 2 hous, 2 days)
  • block shards for a very long amount of time (1 week)

Additionally, we will run a similar test to observer how mcrouter behaves when failing over to a secondary set of proxies when replicating keys (aka gutter proxies)

Graphs and logs:

Testing in Production Roadmap
Initially we want to test the failover function in production with minimum risk. The keys we can easily afford to loose without user impact, are the keys we replicate from eqiad to codfw
(/*/mw-wan keys) via the proxies. We can then move forward with trying out the gutter pool cluster in the canary servers.

Current issues (non blocking):

  • tko per server metric for the exporter seems not working - @elukey fixed it! Dev25/mcrouter_exporter
  • investigate if there are other failover metrics that we can use, and if they have value

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jijiki removed a subtask: Unknown Object (Task).Jan 5 2020, 7:57 PM
jijiki renamed this task from Plan to upgrade our application object caching service (memcached) to Upgrade and improve our application object caching service (memcached).Jan 5 2020, 8:34 PM
jijiki updated the task description. (Show Details)
jijiki triaged this task as Medium priority.Jan 6 2020, 2:45 PM
jijiki added projects: serviceops, SRE.
jijiki added subscribers: elukey, mark.

Gutter pool has been initially tested in Beta and looks well. To make this test work, we deployed the a config to mcrouter (attached at the bottom) running on deployment-mediawiki-07, where we set deployment-memc05 as the "main/primary" memcached server, and deployment-memc08 as the failover one. In other words, mcrouter will failover to deployment-memc08 when deployment-memc05 is marked with a TKO (unavailable). We also changed the ncrouter command line argument --timeouts-until-tko=3 instead of --timeouts-until-tko=10, 3s before it is marked with a TKO. Lastly, every 60s we would block or unblock traffic towards deployment-memc05, so to trigger TKOs.

Mcrouter would failover to deployment-mc08 after 3s, which is what we configured it. It would take about 30s though, after we lift the block towards deployment-memc05 for mcrouter to restore traffic to it.

Next step is to test this configuration on mwdebug* servers in production, as soon as the eqiad gutter pool is ready.

{
	"pools":{
	  "eqiad":{
		"servers":[
		  "deployment-memc05:11211:ascii:plain"
		]
	  },
	  "eqiad-gutter": {
		  "servers": [
			"deployment-memc08:11211:ascii:plain"
		  ]
	  }
	},
	"routes": [
	  {
		"aliases": [
		  "/eqiad/mw/"
		],
		"route":{
	      "type": "FailoverRoute",
	      "children": [
		    "PoolRoute|eqiad",
		    "PoolRoute|eqiad-gutter"
	     ],
	     "failover_errors": [
		   "tko"
	     ]
	}
	  },
	  {
		"aliases": [
		  "/eqiad/mw-wan/"
		],
		"route": {
		  "default_policy": "PoolRoute|eqiad",
		  "operation_policies": {
			"delete": {
			  "children": [
				"PoolRoute|eqiad"
			  ],
			  "type": "AllSyncRoute"
			},
			"set": {
			  "children": [
				"PoolRoute|eqiad"
			  ],
			  "type": "AllSyncRoute"
			}
		  },
		  "type": "OperationSelectorRoute"
		}
	  }
	]
  }

Couple of random thoughts:

  • we should check the diff between our mcrouter version, 0.37, and the last upstream 0.41, to see if any important bug was present/fixed for the failover route.
  • Where are those 30s of recovery from a failover coming from? Is it some default set by mcrouter?
  • When we are confident that mcrouter works with failover etc.. I'd come up with a configuration to allow failover of the mw2 proxies. We currently use 4 mw2 hosts in codfw as mcrouter proxies to replicate certain keys from eqiad to codfw, but they are memcached shards from the point of view of mcrouter, so if timeouts are registered (say somebody reboots the host etc..) then a TKO happens and traffic is not replicated. We could use the failover route to use other 4 mw2 hosts as "gutter pool" and increase resiliency. Should be an easy config to deploy that we could also allow a more in depth testing in production (without affecting too much traffic).

@RLazarus has taken a look at 0.41 some time ago, not sure if he remembers something in that direction.

Change 569579 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: put memcached gutter hosts in cluster memcached_gutter

https://gerrit.wikimedia.org/r/569579

Change 569579 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: put memcached gutter hosts in cluster memcached_gutter

https://gerrit.wikimedia.org/r/569579

Change 570059 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: put memcached gutter cluster in monitoring.yaml

https://gerrit.wikimedia.org/r/570059

Change 570059 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: put memcached gutter cluster in monitoring.yaml

https://gerrit.wikimedia.org/r/570059

Mentioned in SAL (#wikimedia-operations) [2020-02-05T09:51:14Z] <effie> install libmemcached-tools on mc-gp* servers - T240684

Change 570364 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::memcached:gutter: set threads to 16

https://gerrit.wikimedia.org/r/570364

Change 570364 merged by Elukey:
[operations/puppet@production] role::mediawiki::memcached:gutter: set threads to 16

https://gerrit.wikimedia.org/r/570364

Just updated https://grafana.wikimedia.org/d/000000317/memcache-slabs adding a new row at the bottom '1.5.x metrics' with all the new metrics of the prometheus exporter (available only selecting memcached_gutter as cluster).

On mwdebug1001 we have deployed a config similar to T240684#5826966, where we failover in case of a TKO to the gutter pool servers (mc-gp*). We again have set --timeouts-until-tko=3.

We have been conducting the following experiments on mwdebug1001, where in both cases we curl a list of 91 urls using mwdebug1001. We examine the following cases:

A) Block access from mwdebug1001 to a random set of 3 memcached servers, for 60" or 240"

(will run again to get sample graphs)

B) Block access to 3 memcached servers "permanently"

https://logstash.wikimedia.org/goto/fc08db38a992f0aed931e9529f2eb249

https://grafana.wikimedia.org/d/5XF4XXyWz/memcache-gutter-pool?orgId=1&from=1581324360000&to=1581358259000

  • We would expect that after we block a shard or shards, we wouldn't see any memcached errors (or we would see a few) since mcrouter would failover to the gutter pool.
  • A handful of URLs are producing the majority of errors (could be related to the shards we blocked)

What now?

  • If there are specific keys that fail, find out what is so special about them
  • We are testing "failover_errors": "tko", we also have to try "connect_timeout", "timeout" and "connect_error".
  • It takes about 60" for mcrouter to re-start using a shard after a TKO, find out why (as @elukey has proposed)

Interesting discovery this morning while chatting with Effie.

Scenario:
All memcached shards down from mwdebug1001's point of view (outgoing traffic dropped via iptables by Effie). TKOs logged by mcrouter, then all traffic is expected to failover to the gutter pool.

Problem:
Mediawiki reports constant memcached errors. We'd have expected a spike when the traffic block happened, to then recover when mcrouter failed over to the gutter pool.

I got a pcap on mwdebug using sudo tcpdump -i lo tcp port 11213 -w test.pcap and noticed something like the following:

add WANCache:v:advisorywiki:page-restrictions:v1:129:411 4 86400 333
a:4:{i:0;i:1;i:1;a:2:{i:0;O:8:"stdClass":4:{s:7:"pr_type";s:4:"edit";s:9:"pr_expiry";s:8:"infinity";s:8:"pr_level";s:13:"autoconfirmed";s:10:"pr_cascade";s:1:"0";}i:1;O:8:"stdClass":4:{s:7:"pr_type";s:4:"move";s:9:"pr_expiry";s:8:"infinity";s:8:"pr_level";s:5:"sysop";s:10:"pr_cascade";s:1:"0";}}i:2;i:86400;i:3;d:1581410938.956572;}
SERVER_ERROR unavailable

Before that, some cas/get/etc.. commands were succeeding. Then I ran the following command:

elukey@mwdebug1001:~$ echo "get __mcrouter__.route(add,'WANCache:v:advisorywiki:page-restrictions:v1:129:411')" | nc -q 2 localhost 11213
VALUE __mcrouter__.route(add,'WANCache:v:advisorywiki:page-restrictions:v1:129:411') 0 16
10.64.0.53:11211
END

10.64.0.53 is mc-gp1001, so the add command seemed routing to the gutter pool as expected. Tried to run manually the add command via netcat, and got SERVER_ERROR unavailable in a reproducible way.
Tried to change the name to the key (adding ssssss at the end), verified that it was routed to a different shard in the gutter pool, and retried the test. No errors reported.

The SERVER_ERROR unavailable returns immediately after issuing the command, so it shouldn't be related to any timeout, but as the mcrouter docs say, it seems as if mc-gp1001 was considered in TKO state from mcrouter.

Triple checked also the server stats to confirm that mc-gp1001 was not considered in tko state (there is usually a tko string mentioned for the shard if so):

elukey@mwdebug1001:~$ echo stats servers | nc -q 2 localhost 11213 | grep 10.64.0.53
STAT 10.64.0.53:11211:ascii:plain:notcompressed-1000 avg_latency_us:569.655 pending_reqs:0 inflight_reqs:0 avg_retrans_ratio:0 max_retrans_ratio:0 min_retrans_ratio:0 up:5; touched:1061 found:206495 notfound:28909 notstored:9405 stored:19841 remote_error:1

Up to now it looks very weird. The mc-gp servers are running buster with a new version of memcached, new settings, etc.. so it could be something related to that, but so far as Effie pointed out seems like mcrouter has some weird issue with the failover route.

I also expected some failover metrics to be logged by our exporter, but I don't see any. From the mcrouter's stat list, it is mentioned something like cmd_[operation]_out_failover, that is also supported by our exporter:

https://github.com/Dev25/mcrouter_exporter/blob/ff0cd9b3e897fffac6c9bbd5079085e1193c55d4/main.go#L131

But I can only see:

elukey@mwdebug1001:~$ echo stats all| nc -q2 localhost 11213| grep failover
STAT failover_all 0
STAT failover_conditional 0
STAT failover_all_failed 0
STAT failover_rate_limited 0
STAT failover_inorder_policy 0
STAT failover_inorder_policy_failed 0
STAT failover_least_failures_policy 0
STAT failover_least_failures_policy_failed 0
STAT failover_custom_policy 0
STAT failover_custom_policy_failed 0

This also looks weird (all zero, mcrouter should have registered errors). Can't find anything in the code that confirms what failover metrics we should see :(

@jijiki you were wondering why mc-gp1003 does not receive traffic, and I found this:

elukey@mwdebug1001:~$ echo stats suspect_servers | nc -q2 localhost 11213
STAT 10.64.48.32:1121 status:tko num_failures:2088
END

suspect_servers Similar to servers but returns list of servers that replied with an error to last request sent to them.

Restarted mcrouter on mwdebug, the error cleared:

elukey@mwdebug1001:~$ echo stats suspect_servers | nc -q2 localhost 11213
END

EDIT: After some debugging we found that it was a config issue related to mc-gp1003, now it seems working as expected! Testing unblocked :)

jijiki renamed this task from Upgrade and improve our application object caching service (memcached) to Test gutter pool failover in production .Feb 11 2020, 1:50 PM
jijiki updated the task description. (Show Details)
jijiki renamed this task from Test gutter pool failover in production to Test gutter pool failover in production and memcached 1.5.x.Feb 12 2020, 9:44 AM
jijiki updated the task description. (Show Details)
jijiki updated the task description. (Show Details)

@elukey thank you for unblocking this !!!

Test if failover works and when to failover

Failover when we hit a TKO.
Failover works properly, when we block access from mwdebug1001 to all mc* hosts, it takes about ~10s (--timeouts-until-tko=10) until TKO kicks in , and then mwdebug1001 switches to using mc-gp1* hosts. For that brief interval, we will see mediawiki errors like P10429

"Memcached error for key "WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A473085455" on server "127.0.0.1:11213": A TIMEOUT OCCURRED"

which go away as soon as mcrouter fails over to the gutter pool. We generally tested the case of failing over when a server TKOs.

Failover on other erros
With other failover strategies (connect_timeout, timeout, connect_error), we just got errors like P10433

"Memcached error for key "WANCache:v:global:SqlBlobStore-blob:enwiki:tt%3A566487073" on server "127.0.0.1:11213": SERVER ERROR"

memcached traffic didn't seem to failover, so and we didn't test them any further.

Test failover routes

We have two options here:

  • FailoverRoute, where mcrouter sends a request to the first child, and if the errors, moves to the second. P10436
    {
    "aliases": [ "/eqiad/mw/" ],
	"route":{
            "type": "FailoverRoute",
            "children": [ "PoolRoute|eqiad", "PoolRoute|eqiad-gutter" ],
            "failover_errors": [ "tko" ]
	  }
    }
  • FailoverWithExptimeRoute, where we define how long the values keys updated (eg set/add) during the failover will live on the gutter. All keys set during a failover with the following config, had a TTL of 10s. P10437
    {
      "aliases": [ "/codfw/mw/" ],
	"route":{
	   "type": "FailoverWithExptimeRoute",
           "normal": "PoolRoute|codfw",
	   "failover": "PoolRoute|codfw-gutter",
	   "failover_exptime": 10,
           "failover_errors": [ "tko" ]
	  }
     }

Test 'gutter proxies', aka replicating /*/mw-wan/ keys

We used mwdebug2001.codfw.wmnet as the 'gutter proxy' for codfw. We did similar tests where mcrouter happily started using the gutter proxy to send keys to the other DC. In the following config we tell mcrouter to use eqiad and eqiad-gutter for all commands except delete and set.

{
      "aliases": [ "/codfw/mw-wan/"],
      "route": {
        "type": "OperationSelectorRoute",
        "default_policy": {
            "type": "FailoverRoute",
            "children": [ "PoolRoute|eqiad", "PoolRoute|eqiad-gutter" ],
            "failover_errors": [ "tko" ]
        },
        "operation_policies": {
          "delete": {
            "type": "AllSyncRoute",
            "children": [ "AllSyncRoute|Pool|codfw", "AllSyncRoute|Pool|codfw-gutter" ]
            },
          "set": {
            "type": "FailoverRoute",
            "children": [ "PoolRoute|codfw", "PoolRoute|codfw-gutter" ],
            "failover_errors": [ "tko" ]
         }
        }
      }
    }

Other comments

  • Mcrouter does not log when it has failover
  • It is not completely predictable how fast a mcrouter will failover, the --timeouts-until-tko=10 is an indication, if the shard is used extensively. Same goes for failing back when a shard is available again. Failing back might take from a few seconds up to 30s
  • After a failover, when the original shard is available again, it of course serves stale keys. We need to discuss how we will deal with this
  • With the configuration we used for /codfw/mw-wan/, deletes are sent to all hosts in the codfw and codfw-gutter pool. We need to investigate if that is useful to us or not
  • In the original /codfw/mw-wan/ config, as well as the ones tested here, we do have a special route for set, but not for add, we need to find out if that is what we need.
  • Regarding delete, mcrouter keeps deletes in a log, which replays it to the original shard when it comes back online. For a brief amount of time (between the shard being unmarked as TKO and mcrouter replays the log), we may get stale keys. This needs to be tested again.

Very nice summary, thanks!

A couple of questions:

FailoverWithExptimeRoute, where we define how long the values keys updated (eg set/add) during the failover will live on the gutter. All keys set during a failover with the following config, had a TTL of 10s.

What happens if we try to set a key with TTL of say 5s? Will it get TTL 10s or 5s? I would hope for 5s but I suspect it might not be the case. I am asking since setting a high enough TTL like 10/20/30 mins could be good if keys set with lower TTLs are not affected.

We used mwdebug2001.codfw.wmnet as the 'gutter proxy' for codfw. We did similar tests where mcrouter happily started using the gutter proxy to send keys to the other DC. In the following config we tell mcrouter to use eqiad and eqiad-gutter for all commands except delete and set.

This bit is not clear to me - set/deletes should use the eqiad pools and also the codfw proxies (regular pool + gutter), right?

A few notes:

  • We cannot really worry too much about stale keys over failovers - we had a system before mcrouter where this was happening regularly and we never had too many issues. Moreover, if a caching system cannot support stale reads, it's a database, not a caching layer.
  • I'd be more aggressive on the failover side, going down to less than 5 seconds for the failover to happen.
  • I would use the FailoverWithExptimeRoute, and set a reasonable TTL there (1 minute?)

About the TTL, I'd involve Timo and Aaron. For some keys, that are expensive to generate or that might cause a ton of traffic if regenerated and set frequently, it might be harmful to set a low TTL like 1 minute. We'd need to find a good compromise, but I don't have enough experience with MediaWiki to propose one :)

One use case for the failover is, in my opinion, being comfortable in using the gutter cluster for some hours. For example, if a mc10XX shard shows constant saturation of rx/tx bandwidth it would be great just to disable puppet on it, stop memcached for the time needed and let the traffic flow to the gutter cluster (with 10G) without a constant flap. Having keys all of a sudden with 1 min TTL might cause even more traffic to the gutter, that have a better network connectivity but we should be careful anyway.

Very nice summary, thanks!

A couple of questions:

FailoverWithExptimeRoute, where we define how long the values keys updated (eg set/add) during the failover will live on the gutter. All keys set during a failover with the following config, had a TTL of 10s.

What happens if we try to set a key with TTL of say 5s? Will it get TTL 10s or 5s? I would hope for 5s but I suspect it might not be the case. I am asking since setting a high enough TTL like 10/20/30 mins could be good if keys set with lower TTLs are not affected.

When we set a TTL lower than what we have defined on`failover_exptime`, it is respected, higher TTLs are ignored.

We used mwdebug2001.codfw.wmnet as the 'gutter proxy' for codfw. We did similar tests where mcrouter happily started using the gutter proxy to send keys to the other DC. In the following config we tell mcrouter to use eqiad and eqiad-gutter for all commands except delete and set.

This bit is not clear to me - set/deletes should use the eqiad pools and also the codfw proxies (regular pool + gutter), right?

We can use "type": "AllSyncRoute" like in the example for deletes where a delete will be forwarded to both the main and the gutter pool, where in turn it will be forwarded to its respected shard, if children are using PoolRoute e.g. "children": [ "PoolRoute|codfw", "PoolRoute|codfw-gutter" ]. We can discuss that if we want it only for /*/wm-wan keys or for all keys.

A few notes:

  • We cannot really worry too much about stale keys over failovers - we had a system before mcrouter where this was happening regularly and we never had too many issues. Moreover, if a caching system cannot support stale reads, it's a database, not a caching layer.

I agree with Luca, @aaron, @Krinkle what is your opinion?

  • I'd be more aggressive on the failover side, going down to less than 5 seconds for the failover to happen.

We can control this a bit more with the --timeouts-until-tko command line option, I am not very sure if we gain from a more sensitive failover mechanism. After a server is marked as TKO, mcrouter will periodically send probes until the shard is back online, and if it is, it will switch back. We risk the case were we have minor network glitches, or a server's NIC being briefly saturated every X seconds, where mcrouter will failover back and forth.

We should keep in mind that the gutter pool is cold, so failing over initially will produce some load. We could make the gutter pool warm by forwarding all commands we send to the main pool, but it is a different discussion if that would be useful or not.

  • I would use the FailoverWithExptimeRoute, and set a reasonable TTL there (1 minute?)

This is a bit debatable. The way I see it, we have some keys with long TTLs (eg a day), if we keep a shard down for say 1hr, we will have this key created and stored multiple times within the hour. Furthermore, if that key requires processing in order to be produced, we will put the mw servers to some more work within that hour.

To proceed with testing, we will puppetise the following configuration, and roll it to a couple of canary servers, and block traffic towards mc* like we did earlier. Our goal is to test memcached 1.5.x on debian Buster, since that is what the gutter pool is running

@aaron @Krinkle Please let us know if the configuration and the information on this task are enough to proceed with testing the gutter proxies for /*/wm-wan/ or if there is anything that needs further investigation/explanation.

What we are going to do is block traffic from mwdebug* and possibly a couple of canary app servers towards the 4 mcrouter proxies in codfw, this way we will force those servers to use the codfw gutter proxies. Does that sound ok as a first test?

Two more things to put in the mix. Firstly, it is not clear yet if failing over to the gutter pool is something that will help us with short unavailabilities. This is something that we will see in practice (or if we decide to run chaos engineering-like tests in prod to simulate that behaviour). On the other hand, we are more confident that this will be very useful when we need to perform maintenance tasks on memcached servers (eg reimaging.)

Secondly, regardless of when we use the gutter pool, do you thing we need to clear the keys of the gutter pool when fail back to the main cluster? Our genera assumption is this so far has been that it does not matter much, is that so?

A few notes:

  • We cannot really worry too much about stale keys over failovers - we had a system before mcrouter where this was happening regularly and we never had too many issues. Moreover, if a caching system cannot support stale reads, it's a database, not a caching layer.

Yeah, we don't depend on mutable cache for important DB writes as a policy. Failover routing should only trigger on TKO, which reduces inconsistency issues from dropped packets and fast flapping. Nevertheless, the TKO trigger thresholds will always be much smaller than the failover max cache value TTL, so relaying deletes to the gutters servers normally seems helpful for slower flapping and split-brains. (when some network partition sides can reach the gutter servers).

  • I'd be more aggressive on the failover side, going down to less than 5 seconds for the failover to happen.

Yeah, the eviction ("tko") delay should be low to avoid prolonging DB traffic spikes. Gutter tombstoning and FailoverWithExptimeRoute helps mitigate flapping from low TKO trigger thresholds.

  • I would use the FailoverWithExptimeRoute, and set a reasonable TTL there (1 minute?)

1 minute should work be OK for most keys. It will cause evictions unexpected by WANCache that "lowTTL"/"hotTTR" cannot avoid. That could lead to brief query spikes, though various mitigations have been made over the years to reduce cache set()/CAS/ADD spike size; https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/570979/ is still pending CR. It's worth testing out 1 minute once that lands.

Couple of random thoughts:

  • we should check the diff between our mcrouter version, 0.37, and the last upstream 0.41, to see if any important bug was present/fixed for the failover route.

It uses MIN( failover TTL, store command TTL), which makes sense to me. See https://github.com/facebook/mcrouter/blob/c5fdc21fb3e9d0f32d01d1878896006b61ce2c79/mcrouter/routes/FailoverWithExptimeRouteFactory.h

Other comments

  • It is not completely predictable how fast a mcrouter will failover, the --timeouts-until-tko=10 is an indication, if the shard is used extensively. Same goes for failing back when a shard is available again. Failing back might take from a few seconds up to 30s

Probably more reason to use a smaller nominal TKO until-timeout.

  • After a failover, when the original shard is available again, it of course serves stale keys. We need to discuss how we will deal with this

The default "hotTTR" value in https://github.com/wikimedia/mediawiki/blob/master/includes/libs/objectcache/wancache/WANObjectCache.php will probably correct the most popular keys in due course (an exceptional case would be when a key is only hit from an access pattern involving heavy CDN caching, were the object cache access rate is itself low). We already have this problem now if a cache server becomes inaccessible (but does not restart) for a while and purges get lost. Improvements can perhaps be made, though I don't we'd be much worse off than now unless --probe-timeout-initial was set very high.

  • With the configuration we used for /codfw/mw-wan/, deletes are sent to all hosts in the codfw and codfw-gutter pool. We need to investigate if that is useful to us or not

This was my early "strict" configuration idea for WANCache when an implicit operation based broadcasting model was being considered and failover model like that of twemproxy was in my mind (tomstones to all servers/DCs). I ended up going with wildcard route key broadcasting. The config was optimized to just follow the normal hashing for purges, especially given the lack of failover.

I think having tombstones/check-keys (SET) and check key resets (DELETE) also go to all gutter servers/DCs using an async route is doable. Tombstone traffic is far less than get/add/cas, by qps and mbps, and gutter servers would normally have even lower get/add/cas traffic than other servers (e.g. when nothing is failing). It would help with some split-brain cases, flapping between gutter <=> normal servers. OTOH, we won't failover *within* the gutter pool, so tombstones really only need to follow the normal PoolRoute to the gutter pools in each DC (one cache key mapping to one server per DC). That would keep purges "horizontally scalable" for good measure.

  • In the original /codfw/mw-wan/ config, as well as the ones tested here, we do have a special route for set, but not for add, we need to find out if that is what we need.

add/cas/touch should follow the same routing as 'get' and should be synchronous as well. They should ideally go to one server at a time, and only to the local DC. No need to involve gutter servers outside of TKO cases.

  • Regarding delete, mcrouter keeps deletes in a log, which replays it to the original shard when it comes back online. For a brief amount of time (between the shard being unmarked as TKO and mcrouter replays the log), we may get stale keys. This needs to be tested again.

From https://github.com/facebook/mcrouter/wiki/Features#reliable-delete-stream it seems like mcrouter would just record the log but not replay it (custom software is needed for that AFAIK). I also find the wording on their wiki, "mcrouter ensures that write to disk completes fully before sending the reply", a bit silly, as that would require slow fsync()/O_DIRECT stuff and I see none of that at https://github.com/facebook/mcrouter/blob/d9f7e8691b9f692e2b4bbcf8dacede855cb654c9/mcrouter/AsyncLog.cpp .

Secondly, regardless of when we use the gutter pool, do you thing we need to clear the keys of the gutter pool when fail back to the main cluster? Our genera assumption is this so far has been that it does not matter much, is that so?

With FailoverWithExptimeRoute, I doubt that would be necessary. Staleness would be more of an issue when switching back off the gutter server.

Yeah, the eviction ("tko") delay should be low to avoid prolonging DB traffic spikes. Gutter tombstoning and FailoverWithExptimeRoute helps mitigate flapping from low TKO trigger thresholds.

Our concern here is that if have a low tko threshold, we might end up failing over and failing back. Given that the gutter pool will be cold, this could eventually produce more load and spikes. Either way, this is something we can easily test in production and decide based ton the results. We can cause an artificial network flapping on a specific mcrouter, and see how the server will behave overall. We can try blocking one shard or all of them

  • I would use the FailoverWithExptimeRoute, and set a reasonable TTL there (1 minute?)

1 minute should work be OK for most keys. It will cause evictions unexpected by WANCache that "lowTTL"/"hotTTR" cannot avoid. That could lead to brief query spikes, though various mitigations have been made over the years to reduce cache set()/CAS/ADD spike size; https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/570979/ is still pending CR. It's worth testing out 1 minute once that lands.

1 minute could be ok if we our case was only short unavailabilities. One of the reasons we decided to try out the gutter pool, is to assist with longer shard dowitimes, eg the upcoming memcached upgrade to Buster. We will have to keep a shard down (and failed over) for a few hours. I we have keys that normal have a TTL of eg 1h, lowering that to 1 min, is likely to cause a problem. The suggestion here is to start with a larger TTL and work our way to find the best fit. It has been mentioned earlier, mcrouter will respect lower TTLs than FailoverWithExptimeRoute's TTLs.

There is one more more solution that might be worth testing at least (and deploying if it makes sense). We could consider using AllAsyncRoute for updates for /eqiad/mw (or /codfw/mw). That way we will always have a warm gutter pool, has the possibility of working well even for short shard unavailabilities. Of course, we will take a hit in terms of traffic, which is something we can determine how much of a problem it can be.

@aaron Please have a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569541/ and let us know if it reflects what we are going for, or if there is some nitpicking to be done. We can afterwards merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/574200/ to test routes for the /*/wm-wan/ keys.

1 minute could be ok if we our case was only short unavailabilities. One of the reasons we decided to try out the gutter pool, is to assist with longer shard dowitimes, eg the upcoming memcached upgrade to Buster. We will have to keep a shard down (and failed over) for a few hours. I we have keys that normal have a TTL of eg 1h, lowering that to 1 min, is likely to cause a problem. The suggestion here is to start with a larger TTL and work our way to find the best fit. It has been mentioned earlier, mcrouter will respect lower TTLs than FailoverWithExptimeRoute's TTLs.

I think 2 minutes is a much safer than 1 minute since it it is higher than WANObjectCache::AGE_NEW. This means that "hotTTR" can refresh very popular keys before the expire (within a 1 minute window). 1 hour is fine as a startling value, I suppose, given a fairly high TKO threshold and purges always going to both when possible.

There is one more more solution that might be worth testing at least (and deploying if it makes sense). We could consider using AllAsyncRoute for updates for /eqiad/mw (or /codfw/mw). That way we will always have a warm gutter pool, has the possibility of working well even for short shard unavailabilities. Of course, we will take a hit in terms of traffic, which is something we can determine how much of a problem it can be.

That would be tricky. Most key sets are ADD/CAS, the former for new/evicted values and the later for updates (pre-emptive refresh or indirect invalidation). A replicated CAS request (say to a normal and a gutter server) is likely to fail on all but one server (the one that handled the GET). ADD also relates to various "lock" key commands. I'd imagine that really hot keys are more likely to get updates in the form of CAS.

Mentioned in SAL (#wikimedia-operations) [2020-04-21T10:49:05Z] <_joe_> mwdebug1001:~# iptables -A INPUT -s 10.64.32.208 -m statistic --mode random --probability 0.1 -j DROP (T240684)

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a subscriber: aaron.

Update - The gutterpools are live. The conversation here does not look finished though, so it'd be good to sync up so that we know what the state of things are.

  • When was it enabled for non-debug servers?
  • Is the TTL capped on gutters, and if so, to how much, and where is this configured?
  • Is it expected that all app servers will generally TKO the memc backend?
    • If so, how long would it typicallly take for all of them to have discovered and decided to TKO? (In other words, how many seconds do we need to add to the overlapping window of uncertainty).
    • If not, then we need to broadcast the mw-wan events to gutter pools even when a server is not in TKO. This should be fine since the mw-wan events should only make up a small portion of the overall mcrouter traffic. Do we do this currently? If not, can we?

@aaron You may want to ask additional questions and/or correct something I got wrong. E.g. can you give a short summary of what the "hard" expectations/needs are from WANCache's perspective?

Trying to answer :)

Update - The gutterpools are live. The conversation here does not look finished though, so it'd be good to sync up so that we know what the state of things are.

  • When was it enabled for non-debug servers?

https://phabricator.wikimedia.org/T244852#6089302

  • Is the TTL capped on gutters, and if so, to how much, and where is this configured?

It is 10 mins, and we set it in hiera like the following:

https://github.com/wikimedia/puppet/blob/c20a90fc8f1569250ee097358119992b88897bb5/hieradata/role/common/mediawiki/appserver.yaml#L10

  • Is it expected that all app servers will generally TKO the memc backend?

It depends, the decision of marking a shard as TKO is taken by every mcrouter. Usually when the failure is very heavy for a mc shard (host down, severe network congestion, etc..) all the mcrouters converge quickly to flag the misbehaving shard with a TKO.

  • If so, how long would it typicallly take for all of them to have discovered and decided to TKO? (In other words, how many seconds do we need to add to the overlapping window of uncertainty).

For a single mcrouter, it takes 10x1s timeouts in a row registered to flag a mc shard as TKO. For all mcrouters to converge I don't have a good answer, but I think not a lot more, especially for severe issues like host down.

  • If not, then we need to broadcast the mw-wan events to gutter pools even when a server is not in TKO. This should be fine since the mw-wan events should only make up a small portion of the overall mcrouter traffic. Do we do this currently? If not, can we?

We don't, and doing it should require a big change in the mcrouter config IIUC. The Failover Route (with TTL) is used only under certain conditions, like TKOs, and the sets are all time capped with 10 mins TTL. Broadcasting mw-wan commands is very tricky in this scenario, is there a valid use case to support this request?

@aaron You may want to ask additional questions and/or correct something I got wrong. E.g. can you give a short summary of what the "hard" expectations/needs are from WANCache's perspective?

(Moving to team inbox for next meeting.)

We don't, and doing it should require a big change in the mcrouter config IIUC. The Failover Route (with TTL) is used only under certain conditions, like TKOs, and the sets are all time capped with 10 mins TTL. Broadcasting mw-wan commands is very tricky in this scenario, is there a valid use case to support this request?

I'll try to be more explicit in the spirit of understanding each other: we don't and we won't.

Using the gutter pool can introduce short term inconsistencies that should be expected from a caching system. This is still exponentially better than anything we had before in terms of consistency[1], but we also have to consider availability:

if the gutter pool gets used, it's because we're outside of normal operating conditions and the alternative would be just failing badly. Adding complexity to an already overly complex setup to avoid any inconsistency in a caching layer seems like not only an overkill, but also just the wrong approach.

[1] To clarify, our evolution has been:

  • nutcracker (up to circa 2018): no consistency guarantee, constant resharding, and very high availability guarantee
  • mcrouter plain: general better consistency and no auto-resharding, but can still lose operations during failures, and has lower availability
  • mcrouter + gutter pool: better consistency (because deletes get replayed during a failover) and availability of all three solutions.

So, if we have a use case where we CANNOT lose a delete to memcached, I would say that use-case is better served by some datastore with ACID transactions and should not be in memcached.

We don't really need purges to go to the gutter cache, given the low TTL there.

Lost purges during convergence is undesirable though acceptable given the nature of the system as a lightweight cache. Longer lasting network split-brains are interesting, though out of scope for such a lightweight cache. This is why we avoid updating the DB based on cache, have ?action=purge, adaptive TTLs in some places, <1 day TTLs in many places, "hotTTR", and so on.

The current setup seems acceptable IMO.

  • mcrouter + gutter pool: better consistency (because deletes get replayed during a failover) […]

MediaWiki does not use DELETE. This came up during the on-host-memcached meetings (meeting notes), but recording it here as well for future reference. MW uses broadcasted SET tombstones with a ttl of 10 seconds (more about that on Wikitech). As such, the gutter pool's TTL of 5 minutes does very signficantly go beyond the ~11s tolerance. As such I would expect in the new setup that whenever one app server's mcrouter perceives one of its Memc servers as down and enters TKO, we will miss all purges issued by that app server; both intra-dc (and cross-dc?). For all other app servers and once that TKO expires, the original cache key will still be there with its stale value and potentially multiple days of TTL left still with no eventual consistency happening before then.

I might be missing something, but that doesnt' seem great. Per @aaron, above, I don't think it's a "problem" given everything else. It just makes it difficult to document and explain how things work (which is what I'm currently trying to do). Anyway, before closing this task my only question is to confirm that the above is accurate so that I can try to document that.

  • mcrouter + gutter pool: better consistency (because deletes get replayed during a failover) […]

MediaWiki does not use DELETE. This came up during the on-host-memcached meetings (meeting notes), but recording it here as well for future reference. MW uses broadcasted SET tombstones with a ttl of 10 seconds (more about that on Wikitech). As such, the gutter pool's TTL of 5 minutes does very signficantly go beyond the ~11s tolerance. As such I would expect in the new setup that whenever one app server's mcrouter perceives one of its Memc servers as down and enters TKO, we will miss all purges issued by that app server; both intra-dc and cross-dc. For all other app servers and once that TKO expires, the original cache key will still be there with its stale value and potentially multiple days of TTL left still with reasona eventual consistency happening before then.

I might be missing something, but that doesnt' seem great. Per @aaron, above, I don't think it's a "problem" given everything else. It just makes it difficult to document and explain how things work (which is what I'm currently trying to do). Anyway, before closing this task my only question is to confirm that the above is accurate so that I can try to document that.

A clarification about the 5 mins TTL - this is the upper bound of TTL values that a key can be set with when mcrouter uses the gutter pool on one appserver. Any TTL lower than that will be left unchanged, so in this case the 10s for a tombstone is not changed to 5 mins. I completely get your point about setting a tombstone on the gutter pool, it is surely not an ideal scenario. Long term we could think about removing any "state" from memcached to avoid these kind of problems, but I am aware that it is a long and difficult task :(

As mentioned above: mcrouter keeps an async log to annotate DELETEs happened during a TKO, allowing to replay them when the TKO is over. I completely get why we use tombstones, but maybe a possible solution (long term) could be to move to something that leverage this feature rather than SET/tombstones.

! In T240684#6339437, @Krinkle wrote:

I might be missing something, but that doesnt' seem great. Per @aaron, above, I don't think it's a "problem" given everything else. It just makes it difficult to document and explain how things work (which is what I'm currently trying to do). Anyway, before closing this task my only question is to confirm that the above is accurate so that I can try to document that.

Let me try to help you with this: memcached is a cache used for speed and scale, and has no guarantee whatsoever of consistency, reliability, or persistence. Speed is really the only thing that matters, and in fact memcached is orders of magnitude faster than any datastore that offers any of those features.

When using memcached at high volumes and heavily distributed like we do you have to assume that:

  • Two servers can and will see different values at the same time
  • Two requests from the same server might see different values at the same time
  • You can set a value now, and it can disappear before you fetch it two lines of code later
  • You can remove a value, and it can still be seen elsewhere

While these conditions are rare enough that it's ok to not program defensively against them, it's definitely a bad idea to rely on the idea of consistency and guarantees this storage system can't provide you.

The gutter pool is just a measure that improves *availability* at large scale, and helps with consistency if you use DELETEs.

Btw, I'm not strictly convinced that MediaWiki doesn't use DELETEs, per mcrouter stats we send out about 1k deletes/second from MediaWiki.

Yeah, technically, all sorts of anomalies are possible, so callers should always (a) avoid DB updates based on cache data, (b) choose reasonable, possibly adaptive, TTLs for the given use case (e.g. would the world suffer if cache was wrong for X days/hours/minutes?), (c) be cognizant of CDN/ObjectCache TTL interactions (race conditions already dictate this awareness anyway, even without lost updates).

The goal of WANCache is all "speed" and "best-effort correctness". There is DB lag and race-condition mitigation, which should be correct if there are no network partitions nor network slowness. While we want to operationally minimize lost packets, flapping, split brains, lost purges, and so on, they can and will happen sometimes. In PACELC terms, it's a PA/EL system. WANCache only has to be "correct" if the network is both non-partitioned and responsive (as defined by our ms timeouts/SLOs). Since the network is reliable enough that the vast majority of operations can succeed, then cache-related features should *normally* "just work"...degrading sometimes only when the network has notable issues.

DELETE is used by resetCheckKey() and also by delete() when given a 0 second "holdoff" by a caller (overriding the default).