HTTPS for internal service traffic
Open, NormalPublic

Description

Eventually we'll want all HTTP traffic on internal networks converted to HTTPS. We should ideally be using client certificate auth with this access as well, so that link traffic injection into supposedly-private service endpoints isn't so easy.

The most critical cases are traffic that's currently crossing inter-datacenter WAN links, or will be soon. However, it's simpler and more-secure in the long run if we just aim to do this for everything regardless of the locality of the traffic sources.

Key cases to work on first:

  1. Tier-2->Tier-1 varnish cache traffic - Currently secured by IPSec, but we could drop IPSec in favor of an HTTPS solution and keep things simpler and more standardized. This is also a relatively-easy target to work out a lot of implementation and puppetization issues before moving on to other cases.
  2. Tier-1 -> *.svc.(codfw|eqiad).wmnet - We'll likely have the ability and desire to put user and cache-backhaul traffic through the codfw cache clusters well ahead of when we're ready for multi-DC at the application layer. This implies codfw cache clusters backending to eqiad service addresses. The IPSec solution currently used for inter-tier varnish traffic above doesn't work for this case, as the service traffic routes through LVS, but HTTPS would work fine here.

In certificate terms, we'll want to use a new local CA to issue certificates within wmnet. The idea would be to create SAN-based certs per cluster for the service hostnames offered by that cluster. For example, mw[0-9]+.eqiad.wmnet machines might share a cert with SAN elements for e.g. appservers.svc.eqiad.wmnet and api.svc.eqiad.wmnet, and the sca cluster machines might have SANs for citoid.svc.eqiad.wmnet, graphoid.svc.eqiad.wmnet, etc...

In case 1, the server-side HTTPS termination can be the same nginx instance used for production frontend traffic, with some additional configuration and/or listeners defined.
In case 2, the server-side HTTPS termination would probably be easiest with a separate inbound TLS proxy (probably a simple variant on the cache clusters' nginx tlsproxy puppet module), so that we don't have to integrate it with all of the server/alias stanzas in the apache configs for now.

In both cases, the primary (most important for the moment, anyways) client traffic source is the varnish instances on the cache clusters. These don't do outbound HTTPS natively, but I think we can address that by using a local proxy on each machine like STunnel. For example, instead of varnish defining the appservers backend as direct access to appservers.svc.eqiad.wmnet:443, it would define it as connecting to localhost:12345, which is an stunnel daemon configured to connect to appservers.svc.eqiad.wmnet:443 for it.

Client cert auth would be based on per-machine certificates. e.g. cp1065.eqiad.wmnet would have a cert for its own hostname for the toubound stunnel proxy to use, and we'd need a local CA that the appservers trust for client certs. The easiest path for this would be to re-use the puppet machine certs and CA for this, but one issue there is that the last time we re-did the puppet cert infrastructure we inexplicably upgraded them to 4K RSA, which could be too much perf impact for this kind of scenario. 2K would have been better. If we're going to re-use puppet certs as client certs, it would be best to first to fix the 4K problem.

Related Objects

BBlack created this task.Aug 10 2015, 1:40 PM
BBlack updated the task description. (Show Details)
BBlack raised the priority of this task from to Needs Triage.
BBlack added projects: HTTPS, Traffic.
BBlack added a subscriber: BBlack.
Restricted Application added a project: acl*operations-team. · View Herald TranscriptAug 10 2015, 1:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 230541 had a related patch set uploaded (by BBlack):
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 10 2015, 1:47 PM
BBlack added a subscriber: faidon.Aug 10 2015, 2:13 PM
BBlack triaged this task as Normal priority.Aug 10 2015, 6:00 PM

Change 230541 merged by BBlack:
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

BBlack updated the task description. (Show Details)Aug 11 2015, 12:35 PM
BBlack set Security to None.
Dzahn moved this task from Backlog to Big Picture on the HTTPS board.Dec 4 2015, 8:42 PM
Krinkle removed a subscriber: Krinkle.May 10 2016, 4:40 PM
BBlack moved this task from Triage to TLS on the Traffic board.Sep 30 2016, 1:44 PM