Page MenuHomePhabricator

HTTPS for internal service traffic
Closed, ResolvedPublic

Description

Eventually we'll want all HTTP traffic on internal networks converted to HTTPS. We should ideally be using client certificate auth with this access as well, so that link traffic injection into supposedly-private service endpoints isn't so easy.

The most critical cases are traffic that's currently crossing inter-datacenter WAN links, or will be soon. However, it's simpler and more-secure in the long run if we just aim to do this for everything regardless of the locality of the traffic sources.

Key cases to work on first:

  1. Tier-2->Tier-1 varnish cache traffic - Currently secured by IPSec, but we could drop IPSec in favor of an HTTPS solution and keep things simpler and more standardized. This is also a relatively-easy target to work out a lot of implementation and puppetization issues before moving on to other cases.
  2. Tier-1 -> *.svc.(codfw|eqiad).wmnet - We'll likely have the ability and desire to put user and cache-backhaul traffic through the codfw cache clusters well ahead of when we're ready for multi-DC at the application layer. This implies codfw cache clusters backending to eqiad service addresses. The IPSec solution currently used for inter-tier varnish traffic above doesn't work for this case, as the service traffic routes through LVS, but HTTPS would work fine here.

In certificate terms, we'll want to use a new local CA to issue certificates within wmnet. The idea would be to create SAN-based certs per cluster for the service hostnames offered by that cluster. For example, mw[0-9]+.eqiad.wmnet machines might share a cert with SAN elements for e.g. appservers.svc.eqiad.wmnet and api.svc.eqiad.wmnet, and the sca cluster machines might have SANs for citoid.svc.eqiad.wmnet, graphoid.svc.eqiad.wmnet, etc...

In case 1, the server-side HTTPS termination can be the same nginx instance used for production frontend traffic, with some additional configuration and/or listeners defined.
In case 2, the server-side HTTPS termination would probably be easiest with a separate inbound TLS proxy (probably a simple variant on the cache clusters' nginx tlsproxy puppet module), so that we don't have to integrate it with all of the server/alias stanzas in the apache configs for now.

In both cases, the primary (most important for the moment, anyways) client traffic source is the varnish instances on the cache clusters. These don't do outbound HTTPS natively, but I think we can address that by using a local proxy on each machine like STunnel. For example, instead of varnish defining the appservers backend as direct access to appservers.svc.eqiad.wmnet:443, it would define it as connecting to localhost:12345, which is an stunnel daemon configured to connect to appservers.svc.eqiad.wmnet:443 for it.

Client cert auth would be based on per-machine certificates. e.g. cp1065.eqiad.wmnet would have a cert for its own hostname for the toubound stunnel proxy to use, and we'd need a local CA that the appservers trust for client certs. The easiest path for this would be to re-use the puppet machine certs and CA for this, but one issue there is that the last time we re-did the puppet cert infrastructure we inexplicably upgraded them to 4K RSA, which could be too much perf impact for this kind of scenario. 2K would have been better. If we're going to re-use puppet certs as client certs, it would be best to first to fix the 4K problem.

Related Objects

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: HTTPS, Traffic.
BBlack subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 230541 had a related patch set uploaded (by BBlack):
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

BBlack triaged this task as Medium priority.Aug 10 2015, 6:00 PM

Change 230541 merged by BBlack:
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

BBlack set Security to None.

All subtasks gone, but there are technically stlil a few edges cases showing up in the trafficserver backend-facing config. Specifically:

$ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml 
      replacement: http://puppetmaster1001.eqiad.wmnet
      #replacement: http://puppetmaster2001.codfw.wmnet
      replacement: http://contint.wikimedia.org
      replacement: http://cloudweb2001-dev.wikimedia.org
      replacement: http://cloudweb2001-dev.wikimedia.org

Do we need to clean these up in some new subtasks, and/or implement some check to prevent adding new http:// replacements? @ema?

Do we need to clean these up in some new subtasks

Yup, tasks created!

and/or implement some check to prevent adding new http:// replacements? @ema?

I think it's enough for now to remember this at code review time and only add a check if we see that in practice new http backends do in fact get added.

$ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml

replacement: http://puppetmaster1001.eqiad.wmnet
#replacement: http://puppetmaster2001.codfw.wmnet
replacement: http://contint.wikimedia.org
replacement: http://cloudweb2001-dev.wikimedia.org
replacement: http://cloudweb2001-dev.wikimedia.org
Do we need to clean these up in some new subtasks

This is also already T210411

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

ema claimed this task.

Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Tier1 and Tier2 DCs is now gone, ...). We are now in a world where all backend caches access the origins via TLS, which I think largely covers what we wanted to achieve here. @BBlack: I'm marking the task as resolved, but of course feel free to reopen / create other tasks as needed if you think that anything is missing.

This comment was removed by taavi.