Page MenuHomePhabricator

Change TLS/load balancer configuration for cloudelastic
Closed, ResolvedPublic

Description

Per parent ticket, migrating cloudelastic to private IPs also means its domain will change from wikimedia.org to eqiad.wmnet .

That means we can no longer rely on acme-chief and letsencrypt to provide certificates (to the best of my knowledge, LE only supports registered TLDs).

I believe we'll also have to add some service discovery/ATS config as well.

Creating this ticket to:

  • Prepare new traffic path (ATS/pybal/)
  • Prepare new TLS configuration (CFSSL is preferred; see this CR for an example of how this might work.

Event Timeline

We'll need Cloudelastic TLS to look more like production Elastic's TLS config, see
modules/profile/manifests/elasticsearch/cirrus.pp
and
modules/elasticsearch/manifests/tlsproxy.pp

We'll probably need a discovery cert too...will pick this up tomorrow.

Change 992547 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Configure post-migration TLS

https://gerrit.wikimedia.org/r/992547

This is a bit more complicated unfortunately. The node FQDNs (cloudelasticXXXX) will indeed move to eqiad.wmnet and need to use CFSSL certs, but the LVS VIP is not and will stay as a wikimedia.org name. Certificates for cloudelastic.wikimedia.org need to stay as LE certs for the clients in WMCS.

Change 992748 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: lay the groundwork for private IP migration

https://gerrit.wikimedia.org/r/992748

Change 993014 had a related patch set uploaded (by Bking; author: Bking):

[operations/dns@master] cloudelastic: add CNAME for migration canary

https://gerrit.wikimedia.org/r/993014

Change 993103 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: use CFSSL for TLS on canary

https://gerrit.wikimedia.org/r/993103

Change 993148 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: apply cloudelastic role to canary

https://gerrit.wikimedia.org/r/993148

Change 992547 merged by Bking:

[operations/puppet@production] cloudelastic: config changes for migration canary

https://gerrit.wikimedia.org/r/992547

Change 993150 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: remove references to cloudelastic1010

https://gerrit.wikimedia.org/r/993150

Change 993150 merged by Bking:

[operations/puppet@production] cloudelastic: remove references to cloudelastic1010

https://gerrit.wikimedia.org/r/993150

Change 993764 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Add migration canary to cloudelastic cluster

https://gerrit.wikimedia.org/r/993764

Change 993148 abandoned by Bking:

[operations/puppet@production] cloudelastic: apply cloudelastic role to canary

Reason:

superseded by I0627fb939764e5bd12156d1bffb410e3732e36ba

https://gerrit.wikimedia.org/r/993148

Change 993103 abandoned by Bking:

[operations/puppet@production] cloudelastic: use CFSSL for TLS on canary

Reason:

superseded by I0627fb939764e5bd12156d1bffb410e3732e36ba

https://gerrit.wikimedia.org/r/993103

Change 993764 merged by Bking:

[operations/puppet@production] cloudelastic: Add migration canary to cloudelastic cluster

https://gerrit.wikimedia.org/r/993764

Change 994321 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: bind to all interfaces

https://gerrit.wikimedia.org/r/994321

Change 994321 merged by Bking:

[operations/puppet@production] cloudelastic: bind to all interfaces

https://gerrit.wikimedia.org/r/994321

https://etherpad.wikimedia.org/p/cloudelastic-T355617 proposes changing the current traffic ingress method from direct access to the LVS service to a setup where ingress traffic is routed via the CDN. This seems unrelated to the reimage and is also problematic because the CDN only listens on the standard HTTPS port (443) and not the alternative ports that cloudelastic uses. Instead to me the simplest option would be:

  1. Update the Puppetization to serve the cloudelastic acme-chief certificate on the public ports (this is fixing the problem I mentioned in T355720#9483823)
  2. Continue to use the existing LVS load balancer as usual. When migrating nodes to the new .eqiad.wmnet names update the names in conftool-data and pool them like they were before the renames.

Change 992748 abandoned by Bking:

[operations/puppet@production] cloudelastic: enable DNS discovery/VIP for test service

Reason:

We are not going to use discovery service after all

https://gerrit.wikimedia.org/r/992748

Change 993014 abandoned by Bking:

[operations/dns@master] cloudelastic: add CNAME for migration canary

Reason:

we are not going to use service discovery after all

https://gerrit.wikimedia.org/r/993014

Change 994338 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: use acme-chief/letsencrypt with canary

https://gerrit.wikimedia.org/r/994338

Change 994338 merged by Bking:

[operations/puppet@production] cloudelastic: use acme-chief/letsencrypt with canary

https://gerrit.wikimedia.org/r/994338

Change 994763 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: allow wmnet hosts to request certs from acme-chief

https://gerrit.wikimedia.org/r/994763

Change 994763 merged by Bking:

[operations/puppet@production] cloudelastic: allow wmnet hosts to request certs from acme-chief

https://gerrit.wikimedia.org/r/994763

Change 994800 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Don't validate certs against FQDN

https://gerrit.wikimedia.org/r/994800

Change 994800 merged by Bking:

[operations/puppet@production] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs

https://gerrit.wikimedia.org/r/994800

Change 994838 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs

https://gerrit.wikimedia.org/r/994838

Change 994838 merged by Bking:

[operations/puppet@production] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs

https://gerrit.wikimedia.org/r/994838

We're about to roll back the last patch. Here's the error we're getting from puppet:

Error: /Stage[main]/Profile::Elasticsearch::Cirrus/Elasticsearch::Tlsproxy[cloudelastic-chi-eqiad]/Tlsproxy::Localssl[cloudelastic-chi-eqiad]/Acme_chief::Cert[cloudelastic.wikimedia.org]/File[/etc/acmecerts/cloudelastic.wikimedia.org]: Could not evaluate: Could not retrieve information from environment production source(s) puppet://acmechief1001.eqiad.wmnet/acmedata/cloudelastic.wikimedia.org

Change 995041 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs

https://gerrit.wikimedia.org/r/995041

Change 995041 merged by Bking:

[operations/puppet@production] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs

https://gerrit.wikimedia.org/r/995041

Change 995107 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: remove soon-to-be-defunct hostnames from SNI

https://gerrit.wikimedia.org/r/995107

Change 995110 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: Add private IP canary back to load balancer pool

https://gerrit.wikimedia.org/r/995110

Change 995110 merged by Bking:

[operations/puppet@production] cloudelastic: Add private IP canary back to load balancer pool

https://gerrit.wikimedia.org/r/995110

After the above changes, we were able to add our canary back to LVS. It's passing health checks and receiving traffic, so we should just be able to repeat the same process for all hosts as part of the larger migration.

As such, I'm closing out this ticket. Thanks to @taavi for your advice and patience! Work continues in T355617.

Change 995107 merged by Bking:

[operations/puppet@production] cloudelastic: remove unneeded hostnames from cert alt names

https://gerrit.wikimedia.org/r/995107