Page MenuHomePhabricator

Enable TLS on memcached for cross-dc replication
Closed, ResolvedPublic

Description

Since version v1.5.13, memcached supports TLS!

Problem

Backstory: Our mcrouter instances have 2 server pools: one that includes all mc* hosts on the local primary DC and another pool that consists of 4 mw* servers which act as a "mcrouter proxy" to the other primary DC, for example in eqiad we have:

"codfw": {
  "servers": [
    "10.192.0.61:11214:ascii:ssl",
    "10.192.16.56:11214:ascii:ssl",
    "10.192.32.113:11214:ascii:ssl",
    "10.192.48.94:11214:ascii:ssl"
  ]
},
"eqiad": {
  "servers": [
    "10.64.0.80:11211:ascii:plain",
    "10.64.0.81:11211:ascii:plain",
    "10.64.0.82:11211:ascii:plain",
    "10.64.0.83:11211:ascii:plain",
    "10.64.0.84:11211:ascii:plain",
    "10.64.16.107:11211:ascii:plain",
    "10.64.16.108:11211:ascii:plain",
    "10.64.16.109:11211:ascii:plain",
    "10.64.16.110:11211:ascii:plain",
    "10.64.32.208:11211:ascii:plain",
    "10.64.32.209:11211:ascii:plain",
    "10.64.32.210:11211:ascii:plain",
    "10.64.32.211:11211:ascii:plain",
    "10.64.32.212:11211:ascii:plain",
    "10.64.48.155:11211:ascii:plain",
    "10.64.48.156:11211:ascii:plain",
    "10.64.48.157:11211:ascii:plain",
    "10.64.48.158:11211:ascii:plain"
  ]
}

Goal
If we were to enable TLS, will eliminate the need to use those "mcrouter proxies", and secure connectivity between mediawiki and the memcached cluster. This will eliminate 4 snowflake mediawiki servers from production! We can run memcached on two ports, a TLS one, for cross-dc replication, and a nontls one for local datacentre traffic.

Versions:

  • v1.6.6: we have this version packaged and ready, but it will need to be deployed with caution since there are some changes which can affect a busy cluster like ours

How? (mediawiki is on eqiad)

We will enable_tls so to have memcached listening on 11214 for TLS connections and on 11211 for notls connections. When both clusters are listening to both ports, we can replace the relevant pools in mcrouter

  • Create the relevant puppet changes
  • Test on mwdebug2001: we can enable_tlson mc2019, add it on mwdebug2001's pool and run a simple url list against mwdebug2001.
  • Enable both tls and notls listening ports on codfw
  • Enable both tls and notls listening ports on eqiad (after June 2021 switchover)
  • Replace the eqiad pool in the codfw mcrouter configs (after June 2021 switchover)
  • Replace the codfw pool in the eqiad mcrouter configs

Notes
We could consider switching all memcached traffic to TLS, but this comes with a major drawback: all tools that can provide real time key traffic (such as memkeys etc all), practically dump the network traffic. If this traffic is encrypted, the tools become useless. We are going to solve this problem at a later time.

Event Timeline

My 2c: I'd vote for 1.6.x since it is close to what upstream is currently supporting, plus I don't think that it would be less stable than the last 1.5.x version.. In 1.6 a lot of new things were added (like the exstore) but nothing big changed from the rest IIRC. Any issue that might arise from running 1.6 could be easily debugged and fixed with upstream if needed (we cannot really follow up with Debian upstream anymore if we package 1.5.22)

Also, memcached 1.6.6 is already used on the IDPs and available in a component.

jijiki triaged this task as Medium priority.Jan 15 2021, 10:02 AM

Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our current setup and the code I see in puppet, but please be explicit about the steps you want to take to switch one server to enable tls.

jijiki renamed this task from Enable TLS on memcached to Enable TLS on memcached for cross-dc replication.EditedMar 4 2021, 5:14 PM
jijiki updated the task description. (Show Details)

Can I ask how do we intend to perform the transition from non-tls to tls in detail? I see a series of pitfalls with our current setup and the code I see in puppet, but please be explicit about the steps you want to take to switch one server to enable tls.

I updated the task description (the task was used as a placeholder for when I would come up with a more well thought plan), comments welcome!

I run a test on mwdebug1001 where I switched on mcrouter its onhost memcached from plain to ssl:

"onhost": {
  "servers": [
    "127.0.0.1:11209:ascii:ssl"
  ]
}

and run mwdebug1001's onhost memcached to listen tls on 0.0.0.0:11209 and notls 0.0.0.0:11210 on with the following arguments:

/usr/bin/memcached  -vv -p  11209 -m 986 -u nobody -c 25000 -f 1.25 -n 48 -l 0.0.0.0  -l notls:0.0.0.0:11210  -Z -o ssl_chain_cert=/var/lib/puppet/ssl/certs/mwdebug1001.eqiad.wmnet.pem  -o ssl_key=/var/lib/puppet/ssl/private_keys/mwdebug1001.eqiad.wmnet.pem

By running memcached verbosed (since now I can't snoop the onhost memcached traffic),

<32 get eswiki:pcache:idoptions:5623639
>32 END
<32 add eswiki:pcache:idoptions:5623639 0 10 707
>32 STORED

So far, so good :)

Change 693474 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] WIP: add notls support for external addresses to memcached

https://gerrit.wikimedia.org/r/693474

Change 694465 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] (WIP) profile::memcached::instance: Add TLS support

https://gerrit.wikimedia.org/r/694465

Change 694484 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] (WIP) hieradata: enable tls on mc2019

https://gerrit.wikimedia.org/r/694484

i wonder if we have considered just having the TLS port every where accept localhost?

i wonder if we have considered just having the TLS port every where accept localhost?

regardless i guess we need a transition config but still curious if this if that is the long trm goal or not?

Change 695377 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] WIP: add notls support for external addresses to memcached (1)

https://gerrit.wikimedia.org/r/695377

i wonder if we have considered just having the TLS port every where accept localhost?

regardless i guess we need a transition config but still curious if this if that is the long trm goal or not?

We want our memcached servers to listen to notls:11211 for DC local memcached traffic and tls:11214 for cross DC traffic. The reason we want the local traffic to be notls is because the tools we use to view live keep traffic, rely on the fact that this traffic is unencrypted.

The configuration that is generated from the patches I submitted, is the final one. Since before and after memcached will be listening to notls:11211, we are good. After we have all memcached servers listening to both, we will switch mcrouter on the mediawiki servers to directly use tls:11214, which is when the change will actually have effect.

Change 695377 merged by Effie Mouzeli:

[operations/puppet@production] modules::memcached: add notls support for external addresses

https://gerrit.wikimedia.org/r/695377

Change 694465 merged by Effie Mouzeli:

[operations/puppet@production] profile::memcached::instance: Add TLS support

https://gerrit.wikimedia.org/r/694465

Change 699727 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: enable tls on codfw gutter pool

https://gerrit.wikimedia.org/r/699727

Change 699727 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: enable tls on codfw gutter pool

https://gerrit.wikimedia.org/r/699727

Change 699738 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: enable TLS for memcached on all codfw hosts

https://gerrit.wikimedia.org/r/699738

Change 694484 abandoned by Effie Mouzeli:

[operations/puppet@production] (WIP) hieradata: enable tls on mc2019 (3)

Reason:

used regex.yaml instead

https://gerrit.wikimedia.org/r/694484

Change 699738 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: enable TLS for memcached on all codfw hosts

https://gerrit.wikimedia.org/r/699738

Change 699764 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: Replace mcrouter proxies with codfw hosts

https://gerrit.wikimedia.org/r/699764

Change 699764 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: Replace mcrouter proxies with codfw hosts on mwdebug1002

https://gerrit.wikimedia.org/r/699764

Change 700861 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: Use TLS codfw pool for memcached replication on eqiad

https://gerrit.wikimedia.org/r/700861

Change 700861 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: Use TLS codfw pool for memcached replication on eqiad

https://gerrit.wikimedia.org/r/700861

Change 702590 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: enable TLS on memcached eqiad hosts

https://gerrit.wikimedia.org/r/702590

Change 702592 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: replace mcrouter proxies in with eqiad hosts

https://gerrit.wikimedia.org/r/702592

Mentioned in SAL (#wikimedia-operations) [2021-07-21T06:35:45Z] <effie> disable puppet on mc1* hosts and icinga - T271967

Change 702590 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: enable TLS on memcached eqiad hosts

https://gerrit.wikimedia.org/r/702590

Change 702592 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: replace mcrouter proxies in with eqiad hosts

https://gerrit.wikimedia.org/r/702592

image.png (1×3 px, 931 KB)

Mcrouter instances in codfw are connecting directly to memeched hosts in eqiad

jijiki claimed this task.
jijiki updated the task description. (Show Details)