Page MenuHomePhabricator

mcrouter production architecture
Closed, ResolvedPublic

Description

We need to decide how to setup mcrouter in production.

The requirements are:

  1. All memcached traffic must go through mcrouter, and be routed consistently
  2. All cross-dc traffic (basically, the replication of SET and DELETE commands) must be encrypted
  3. Mcrouter should ensure row-level redundancy

We're currently testing how mcrouter-mcrouter encrypted connections work, and it seems they only work with IP-based SANs, which are not currently supported by puppet, adding another level of complexity.

Strawman proposal

Install two mcrouter instances per appserver row, giving them proper CNAMEs, and make them communicate with:

  • the local memcacheds directly
  • the memcacheds in the other datacenter via SSL by means of the mcrouter machines in the other cluster.

I would prefer this to going the same way we went with nutcracker (installing the proxy on all appservers) because of the need for inter-dc connections and of the complexity in managing certificates with IP-based SANs (who don't allow wildcards).

Event Timeline

After some consideration, I see three options moving forward:

Option A:

  • Mcrouter is installed on one memcached host per row, where the size of the memcached db will be somewhat reduced as a consequence (mcrouter uses some memory)
  • Possibly we want to cap mcrouter's memory usage, or even dedicate some servers to it; but it would seem like a waste of resources
  • Mediawiki gets configured to talk to nutcracker locally on each appserver, with hash function murmur (the fastest one) and use random distribution, with the dc-local mcrouter instances as backends
  • Mcrouter has all of the mc* hosts in the local dc as a local pool, and the remote mcrouters via SSL as a remote pool

Option B:
- set up something like stunnel on the memcached hosts, listening on port 11212
- Add mcrouter to every appserver, in place of twemproxy for memcached connections (we will still need twemproxy for redis)
- configure mcrouter to talk to the local dc via port 11211, and the remote dc via port 11212 with TLS

Option C:

  • We install mcrouter without ssl on all memcached hosts
  • Nutcracker on the appservers will load-balance across all those instances (as in option B)
  • Mcrouter is configured to talk to both pools, local and remote, unencrypted
  • We use IPSEC to secure communications cross-dc; It would mean every server would need to hold 18 ipsec connections.

Let's break down the pros and cons I see:

Option A:
Pros:

  • no indirection layer, SSL is supported natively
  • transition is relatively easy
  • Only one software on the appservers
  • Cross-dc unencrypted traffic is impossible

Cons:

  • much more complex than our current setup
  • Creates more serious points of failure, although it should

Option B:
Pros:

  • Very simple, it would be mostly straightforward to configure
  • The solution most resilient to single-machine failure
  • Cross-dc unencrypted traffic is impossible

Cons:

  • We rely on the availability of stunnel or something similar for cross-dc sets
  • We have to run nutcracker and mcrouter at the same time on the appservers

Option C:
Pros:

  • transition is relatively easy
  • Only one software on the appservers

Cons:

  • Cross-dc unencrypted traffic is possible; or -alternatively- we rely on the availability of the IPSEC tunnel for cross-dc sets

At the moment, my choice would be option C if we decide cross-dc unencrypted traffic is admissible for small periods of time, because of its relative simplicity, and option B otherwise, as it's the most straightforward and reasonable solution IMHO. I must say the only option that would surely work is option C, in the other cases we have the following risks:

  • in option A, we have to avoid circular replication on SET and DELETE commands. That might be doable via the preserve_route_prefix in mcrouter but it needs testing
  • in option B, we have to ensure that changing the server port doesn't change the hashing function in mcrouter. In that case, keys written from one DC would be unreadable in the other one.

I've an additional question, what is the expected behaviour in the following failure scenarios for each option?

  • One of the hosts that host mcrouter dies (and if there are differences on local DC vs remote DC)
  • The mcrouter process dies on one host (and if there are differences on local DC vs remote DC)

I've an additional question, what is the expected behaviour in the following failure scenarios for each option?

  • One of the hosts that host mcrouter dies (and if there are differences on local DC vs remote DC)
  • The mcrouter process dies on one host (and if there are differences on local DC vs remote DC)

With option A and C, the failure scenario is basically that a single mcrouter failure should only affect a few requests for all appservers, while in option B the failure of one mcrouter would make all resquests from one appserver to fail, including all GETs, SETs and DELETEs, pretty much like it's now with twemproxy

After some tests:

  • Option B wouldn't work. Using a different host:port combination changes the host_id in consistent hashing so the keys end up on another host in the end.
  • Option A works, as long as certs include an IP-based SAN. It needs for a code tweak at the php level - we need to prepend /$dc/mw-wan/ or some other dc-specific prefix to keys we set and fetch to/from memcache from WANCache. This made me consider we could make the rollout of the new system per-wiki rather than per-percentage of traffic.
  • Option C works as well, and has the same necessities in terms of deployment as option A.

As things stand, I am considering a tweaked version of option A - install mcrouter on each appserver, and use N per DC as proxies. This will need us to have a solution for doing IP-based SANs automatically or something like that - basically a better CA than puppet. We could live with manual generation via cergen for the time being though.

Change 432977 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[cergen@master] Add the ability to create CSRs with IP SANs as well

https://gerrit.wikimedia.org/r/432977

Change 432977 merged by jenkins-bot:
[cergen@master] Add the ability to create CSRs with IP SANs as well

https://gerrit.wikimedia.org/r/432977

If SET/DELETE go to all mc* servers in the wancache-(eqiad/codfw) pools (as mediawiki_wancache is configured to do in puppet), then Option B would still work since the consistent hashing wouldn't matter. Having broadcasted operations go to all mc* servers rather than just 1-per-DC (based on hash) is not required for WANCache though. Keeping it this way wouldn't scale well if the rate of those (purge) operations increased hugely for some reason. I do like the conceptual simplicity though.

The hybrid proxy approach in https://gerrit.wikimedia.org/r/#/c/431737/4/hieradata/common/mcrouter.yaml seems reasonable to me. I assume the eventual choice of proxies would at some point no longer be MW appservers though.

The reason of the hybrid proxy approach is that mcrouter is known to use a non-insignificant amount of memory when under write pressure, so I wanted to avoid sharing the same machines as memcached itself.

We can for sure think of having global proxies in both datacenters in the future, but we can reassess at a later time.

The reason of the hybrid proxy approach is that mcrouter is known to use a non-insignificant amount of memory when under write pressure, so I wanted to avoid sharing the same machines as memcached itself.

We can for sure think of having global proxies in both datacenters in the future, but we can reassess at a later time.

Yeah, it seems fine for now.

Change 436240 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] utils: add script to generate mcrouter-related certs

https://gerrit.wikimedia.org/r/436240

Change 436240 merged by Giuseppe Lavagetto:
[operations/puppet@production] puppetmaster::frontend: add cergen-managed CA for mcrouter

https://gerrit.wikimedia.org/r/436240

Change 436531 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::mediawiki::mcrouter_wancache: update the ssl paths

https://gerrit.wikimedia.org/r/436531

Change 436532 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mcrouter: fix hiera labels, install on mwdebug servers

https://gerrit.wikimedia.org/r/436532

Change 436531 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::mediawiki::mcrouter_wancache: update the ssl paths

https://gerrit.wikimedia.org/r/436531

Change 436532 merged by Giuseppe Lavagetto:
[operations/puppet@production] mcrouter: fix hiera labels, install on mwdebug servers

https://gerrit.wikimedia.org/r/436532

Mcrouter is now installed across the fleet (minus the deployment servers), and I confirmed that replication works as expected:

Setting a key to the broadcast prefix /*/mw-wan does replicate

EQIAD-APPSERVER
set /*/mw-wan/test 1 0 12
Hello World2
STORED
get test
VALUE test 1 12
Hello World2
END

and subsequently reading the key in codfw succeeds; we also send a broadcasted delete

CODFW APPSERVER
get test
VALUE test 1 12
Hello World2
END
delete /*/mw-wan/test
DELETED
get test
END

which results in the key disappearing in the other datacenter

EQIAD APPSERVER
get test
END