Page MenuHomePhabricator

IPSec: roll-out plan
Closed, ResolvedPublic

Description

This task is to plan the incremental deployment of IPsec transports in production.

The smallest fraction of production traffic may be moved to IPsec transport by selecting one pair of nodes from the largest pools: one text node in ESAMS + one in EQIAD. Alternately, we could do one pair of nodes from the upload pool so that in the worst case a failure will only result in images not loading rather than a page not loading.

Because firewalls enforcing IPsec transport have not been configured (T85823), any failure to establish encrypted transport is expected to result in uninterrupted communication between affected hosts via standard unencrypted transport. In the case that traffic is interrupted for any reason, a fall back can be affected by executing 'ipsec-global down' on at least one of the affected nodes, as described in T88536.

Once this milestone is passed and we are satisfied with the function and performance of this single pair of nodes, deployment will continue with application of the ipsec role to greater numbers of text caches, and following a similar strategy for other cache classes.

Event Timeline

Gage raised the priority of this task from to Needs Triage.
Gage updated the task description. (Show Details)
Gage subscribed.
Gage triaged this task as Medium priority.
Gage set Security to None.

Status update:

  • ipsec puppet role is running on
    • codfw: all multi-tier cache clusters (text, mobile, bits, upload)
    • eqiad: only the text cluster
    • esams: only cp3030 (text cluster)
    • ulsfo: none

The net of this is that only cp3030 is actually using ipsec in practice, and with the changes today all of its backhaul to eqiad (and in theory, codfw too) is ipsec. This finally gives us a full picture of the perf impact of full ipsec deployment on a tier2 cache node. Will watch stats there as we ramp into the next EU daytime and then move forward as appropriate.

The probable next steps are:

  1. Ramp in the rest of the cp30xx text-cluster nodes, which will show us a somewhat-full load on the tier1 side on the eqiad text clusters. Should be pretty unsurprising once we've seen the tier2 impact.
  2. Turn on ulsfo text-cluster as well, which gives us full ipsec for the text cluster globally at that point.
  3. Turn on the other multi-tier clusters: mobile, upload, bits (unless bits goes away before we get to this step).

Change 227692 had a related patch set uploaded (by BBlack):
enable ipsec on cp3031,40,41

https://gerrit.wikimedia.org/r/227692

Change 227693 had a related patch set uploaded (by BBlack):
enable ipsec on all remaining esams text

https://gerrit.wikimedia.org/r/227693

CPU impact on cp3030 seems to be minimal. You can see it if you squint at the graph, but it's not significant in any decision-making sort of way. Moving forward today with remainder of text cluster in pieces.

However, I forget to test another scenario: should do a cache wipe (backend + frontend) on a depooled cp3030 and then repool it, to see the ipsec spike from cache reload. Will do that before turning on any more nodes in esams.

Cache-wipe test didn't induce any notable spike, probably because the order-of-magnitude (or more) traffic reduction we see from fe->be in the text-cache case makes text-cache backend dataset reloads pretty slow and relatively insignificant. Proceeding with remainder of text rollout.

Should do a similar isolated test for the upload caches as well before fully enabling that cluster, as their backend dataset is much more active and thus could have more impact on reload.

BBlack renamed this task from IPsec: roll-out plan to IPSec: roll-out plan.Jul 29 2015, 1:44 PM
BBlack claimed this task.

Change 227692 merged by BBlack:
enable ipsec on cp3031,40,41

https://gerrit.wikimedia.org/r/227692

Change 227693 merged by BBlack:
enable ipsec on all remaining esams text

https://gerrit.wikimedia.org/r/227693

Update: all of esams text cluster is now using ipsec for backhaul to tier1.

Change 227867 had a related patch set uploaded (by BBlack):
enable ipsec on ulsfo text cluster

https://gerrit.wikimedia.org/r/227867

Change 227867 merged by BBlack:
enable ipsec on ulsfo text cluster

https://gerrit.wikimedia.org/r/227867

Update: all of the text cluster has ipsec turned on globally now. Holding here until tomorrow in case there's some subtle fallout not yet being observed.

Change 227980 had a related patch set uploaded (by BBlack):
enable ipsec for mobile and bits clusters

https://gerrit.wikimedia.org/r/227980

Change 227980 merged by BBlack:
enable ipsec for mobile and bits clusters

https://gerrit.wikimedia.org/r/227980

Change 228811 had a related patch set uploaded (by BBlack):
enable upload ipsec for eqiad cp3034 for upload-reload testing

https://gerrit.wikimedia.org/r/228811

Change 228811 merged by BBlack:
enable upload ipsec for eqiad cp3034 for upload-reload testing

https://gerrit.wikimedia.org/r/228811

cp3034 (upload esams) has had all its backhaul to eqiad over ipsec now for ~1h. CPU impact is, again, virtually non-existent. Attempting wipe-test now.

Wipe-test success. The cpu bump in iowait% from rewriting the cache (which is also minor) is easier to see than any from the related extra crypto.

Change 228838 had a related patch set uploaded (by BBlack):
enable ipsec for all upload caches

https://gerrit.wikimedia.org/r/228838

Change 228838 merged by BBlack:
enable ipsec for all upload caches

https://gerrit.wikimedia.org/r/228838

ipsec is active for all of the primary clusters for cache<->cache from tier2 to tier1: text, upload, mobile, bits. bits doesn't technically protect anything since its inter-DC traffic is via-LVS, but that cluster is going away ASAP regardless and isn't handling much of the traditional bits traffic anymore anyways.