HTTPS for internal service traffic
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Aug 10 2015, 1:40 PM

Description

Eventually we'll want all HTTP traffic on internal networks converted to HTTPS. We should ideally be using client certificate auth with this access as well, so that link traffic injection into supposedly-private service endpoints isn't so easy.

The most critical cases are traffic that's currently crossing inter-datacenter WAN links, or will be soon. However, it's simpler and more-secure in the long run if we just aim to do this for everything regardless of the locality of the traffic sources.

Key cases to work on first:

Tier-2->Tier-1 varnish cache traffic - Currently secured by IPSec, but we could drop IPSec in favor of an HTTPS solution and keep things simpler and more standardized. This is also a relatively-easy target to work out a lot of implementation and puppetization issues before moving on to other cases.
Tier-1 -> *.svc.(codfw|eqiad).wmnet - We'll likely have the ability and desire to put user and cache-backhaul traffic through the codfw cache clusters well ahead of when we're ready for multi-DC at the application layer. This implies codfw cache clusters backending to eqiad service addresses. The IPSec solution currently used for inter-tier varnish traffic above doesn't work for this case, as the service traffic routes through LVS, but HTTPS would work fine here.

In certificate terms, we'll want to use a new local CA to issue certificates within wmnet. The idea would be to create SAN-based certs per cluster for the service hostnames offered by that cluster. For example, mw[0-9]+.eqiad.wmnet machines might share a cert with SAN elements for e.g. appservers.svc.eqiad.wmnet and api.svc.eqiad.wmnet, and the sca cluster machines might have SANs for citoid.svc.eqiad.wmnet, graphoid.svc.eqiad.wmnet, etc...

In case 1, the server-side HTTPS termination can be the same nginx instance used for production frontend traffic, with some additional configuration and/or listeners defined.
In case 2, the server-side HTTPS termination would probably be easiest with a separate inbound TLS proxy (probably a simple variant on the cache clusters' nginx tlsproxy puppet module), so that we don't have to integrate it with all of the server/alias stanzas in the apache configs for now.

In both cases, the primary (most important for the moment, anyways) client traffic source is the varnish instances on the cache clusters. These don't do outbound HTTPS natively, but I think we can address that by using a local proxy on each machine like STunnel. For example, instead of varnish defining the appservers backend as direct access to appservers.svc.eqiad.wmnet:443, it would define it as connecting to localhost:12345, which is an stunnel daemon configured to connect to appservers.svc.eqiad.wmnet:443 for it.

Client cert auth would be based on per-machine certificates. e.g. cp1065.eqiad.wmnet would have a cert for its own hostname for the toubound stunnel proxy to use, and we'd need a local CA that the appservers trust for client certs. The easiest path for this would be to re-use the puppet machine certs and CA for this, but one issue there is that the last time we re-did the puppet cert infrastructure we inexplicably upgraded them to 4K RSA, which could be too much perf impact for this kind of scenario. 2K would have been better. If we're going to re-use puppet certs as client certs, it would be best to first to fix the 4K problem.

Details

	Subject	Repo	Branch	Lines +/-
	cache::config: replace lvs IP refs with service hostnames	operations/puppet	production	+24 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• ema	T108580 HTTPS for internal service traffic
Duplicate	None	T109315 Enable HTTPS on internal MediaWiki appserver virtual service hostnames
Invalid	None	T109321 Inbound TLS for tier-1 varnish backend caches
Invalid	None	T109325 Outbound HTTPS for varnish backend instances
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T126206 Upgrade to Varnish 4: things to remember
Resolved	• ema	T128788 Port varnishlog.py to new VSL API
Resolved	• ema	T131353 Port remaining scripts depending on varnishlog.py to new VSL API
Resolved	• ema	T131501 Convert misc cluster to Varnish 4
Resolved	• ema	T134989 WDQS empty response - transfer clsoed with 15042 bytes remaining to read
Resolved	• ema	T131502 Convert upload cluster to Varnish 4
Resolved	BBlack	T131761 Solve large-object/stream/pass/chunked in upload cluster better
Resolved	• ema	T142076 Analyze Range requests on cache_upload frontend
Resolved	• ema	T142233 Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections
Resolved	• ema	T131503 Convert text cluster to Varnish 4
Resolved	BBlack	T135696 Sort out vcl_deliver vs vcl_synth mess with v4 VCL
Resolved	• ema	T150660 Post Varnish 4 migration cleanup
Resolved	taavi	T263829 cloudweb2001-dev: add TLS termination
Resolved	taavi	T263830 contint.wikimedia.org: add TLS termination
Resolved	jbond	T263831 puppetmaster[12]001: add TLS termination

Event Timeline

BBlack created this task.Aug 10 2015, 1:40 PM

BBlack raised the priority of this task from to Needs Triage.

BBlack updated the task description. (Show Details)

BBlack added projects: HTTPS, Traffic.

BBlack subscribed.

Restricted Application added a project: acl*sre-team. · View Herald TranscriptAug 10 2015, 1:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 230541 had a related patch set uploaded (by BBlack):
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 10 2015, 1:47 PM

gerritbot added a project: Patch-For-Review.Aug 10 2015, 1:47 PM

BBlack added a subscriber: faidon.Aug 10 2015, 2:13 PM

BBlack triaged this task as Medium priority.Aug 10 2015, 6:00 PM

Change 230541 merged by BBlack:
cache::config: replace lvs IP refs with service hostnames

https://gerrit.wikimedia.org/r/230541

BBlack mentioned this in rOPUPfa4b4be86826: cache::config: replace lvs IP refs with service hostnames.Aug 10 2015, 9:49 PM

BBlack updated the task description. (Show Details)Aug 11 2015, 12:35 PM

BBlack set Security to None.

BBlack mentioned this in T108953: Cassandra inter-node encryption (TLS).Aug 14 2015, 3:22 PM

BBlack mentioned this in T110065: Switch codfw caches to tier2, begin pushing some traffic through them to test.Aug 24 2015, 5:02 PM

Krinkle subscribed.Sep 2 2015, 10:43 PM

faidon added a parent task: T111653: Encrypt all the things.Sep 6 2015, 8:59 PM

jcrespo removed a project: Patch-For-Review.Sep 11 2015, 5:54 PM

Dzahn moved this task from Backlog to Big Picture on the HTTPS board.Dec 4 2015, 8:42 PM

BBlack removed a parent task: T111653: Encrypt all the things.Dec 15 2015, 7:18 PM

BBlack mentioned this in T125510: Traffic Infrastructure support for Mar 2016 codfw rollout.Feb 2 2016, 1:12 PM

BBlack added a project: codfw-rollout.May 4 2016, 4:50 PM

Krinkle unsubscribed.May 10 2016, 4:40 PM

BBlack moved this task from Backlog to TLS on the Traffic board.Sep 30 2016, 1:44 PM

fgiunchedi mentioned this in T150822: Internal PKI for secure communication - Barcelona Ops offsite 2016.Nov 16 2016, 1:23 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:14 PM

• Mholloway subscribed.Sep 18 2019, 3:11 AM

sbassett subscribed.Dec 11 2019, 4:44 PM

BBlack closed subtask T109321: Inbound TLS for tier-1 varnish backend caches as Invalid.Sep 23 2020, 4:38 PM

BBlack closed subtask T109325: Outbound HTTPS for varnish backend instances as Invalid.

All subtasks gone, but there are technically stlil a few edges cases showing up in the trafficserver backend-facing config. Specifically:

$ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml 
      replacement: http://puppetmaster1001.eqiad.wmnet
      #replacement: http://puppetmaster2001.codfw.wmnet
      replacement: http://contint.wikimedia.org
      replacement: http://cloudweb2001-dev.wikimedia.org
      replacement: http://cloudweb2001-dev.wikimedia.org

Do we need to clean these up in some new subtasks, and/or implement some check to prevent adding new http:// replacements? @ema?

• ema mentioned this in T263829: cloudweb2001-dev: add TLS termination .Sep 25 2020, 8:11 AM

• ema mentioned this in T263830: contint.wikimedia.org: add TLS termination.

• ema mentioned this in T263831: puppetmaster[12]001: add TLS termination.Sep 25 2020, 8:15 AM

In T108580#6488253, @BBlack wrote:

Do we need to clean these up in some new subtasks

Yup, tasks created!

and/or implement some check to prevent adding new http:// replacements? @ema?

I think it's enough for now to remember this at code review time and only add a check if we see that in practice new http backends do in fact get added.

• ema added a subtask: T263829: cloudweb2001-dev: add TLS termination .Sep 25 2020, 8:26 AM

• ema added a subtask: T263830: contint.wikimedia.org: add TLS termination.

• ema added a subtask: T263831: puppetmaster[12]001: add TLS termination.

jbond closed subtask T263831: puppetmaster[12]001: add TLS termination as Resolved.Dec 17 2020, 4:59 PM

In T108580#6488253, @BBlack wrote:

$ grep 'replacement: http:' hieradata/common/profile/trafficserver/backend.yaml

replacement: http://puppetmaster1001.eqiad.wmnet
#replacement: http://puppetmaster2001.codfw.wmnet
replacement: http://contint.wikimedia.org
replacement: http://cloudweb2001-dev.wikimedia.org
replacement: http://cloudweb2001-dev.wikimedia.org

Do we need to clean these up in some new subtasks

This is also already T210411

Dzahn mentioned this in T210411: Applayer services without TLS.Dec 17 2020, 5:04 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

taavi closed subtask T263830: contint.wikimedia.org: add TLS termination as Resolved.Nov 30 2021, 5:13 PM

taavi closed subtask T263829: cloudweb2001-dev: add TLS termination as Resolved.Dec 7 2021, 6:19 PM

Many of the assumptions made when this task was created have changed since the migration to ATS for cache backends (no more IPSec, the difference between Tier1 and Tier2 DCs is now gone, ...). We are now in a world where all backend caches access the origins via TLS, which I think largely covers what we wanted to achieve here. @BBlack: I'm marking the task as resolved, but of course feel free to reopen / create other tasks as needed if you think that anything is missing.

taavi subscribed.Dec 8 2021, 8:32 AM

This comment was removed by taavi.

HTTPS for internal service trafficClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

HTTPS for internal service traffic
Closed, ResolvedPublic
Actions

Related Objects
Search...