Eventually we'll want all HTTP traffic on internal networks converted to HTTPS. We should ideally be using client certificate auth with this access as well, so that link traffic injection into supposedly-private service endpoints isn't so easy.
The most critical cases are traffic that's currently crossing inter-datacenter WAN links, or will be soon. However, it's simpler and more-secure in the long run if we just aim to do this for everything regardless of the locality of the traffic sources.
Key cases to work on first:
1. Tier-2->Tier-1 varnish cache traffic - Currently secured by IPSec, but we could drop IPSec in favor of an HTTPS solution and keep things simpler and more standardized. This is also a relatively-easy target to work out a lot of implementation and puppetization issues before moving on to other cases.
2. Tier-1 -> *.svc.(codfw|eqiad).wmnet - We'll likely have the ability and desire to put user and cache-backhaul traffic through the codfw cache clusters well ahead of when we're ready for multi-DC at the application layer. This implies codfw cache clusters backending to eqiad service addresses. The IPSec solution currently used for inter-tier varnish traffic above doesn't work for this case, as the service traffic routes through LVS, but HTTPS would work fine here.
In certificate terms, we'll probably want to use a local CA to issue certificates within wmnet for these service endpoints, and also authorize a local CA which issues client certificates to the consumers (per-machine certs - it's possible we can re-use puppet's client certs for this? - If so, we should fix the 4K key problem there first so that performance isn't awful at scale), and have the service endpoints only allow traffic from authorized clients.
In case 1, the server-side HTTPS termination can be the same nginx instance used for production frontend traffic, with some additional configuration and/or listeners defined.
In case 2, the server-side HTTPS termination would be the relevant apache or nginx server already serving the HTTP traffic (the most-common cases on mw* don't have port-443 listeners at all currently).
In both cases, the primary (most important for the moment, anyways) client traffic source is the varnish instances on the cache clusters. These don't do outbound HTTPS natively, but I think we can address that by using a local proxy on each machine like [[ https://www.stunnel.org/ | STunnel ]]. For example, instead of varnish defining the appservers backend as direct access to appservers.svc.eqiad.wmnet:443, it would define it as connecting to localhost:12345, which is an stunnel daemon configured to connect to appservers.svc.eqiad.wmnet:443 for it.