Page MenuHomePhabricator

Move https termination from nginx to envoy (if possible)
Closed, ResolvedPublic

Description

In T227860 we added HTTPS support to Analytics UIs using nginx. This has revealed some missing support in puppet for Buster, and overall it seems that nobody will maintain TLS settings for nginx in the long term (since Traffic moved to ATS). The other services are using envoy, so we should investigate if using it is feasible.

Services to do:

  • hue.wikimedia.org - analytics-tool1001.eqiad.wmnet
  • hue-next.wikimedia.org - an-tool1009.eqiad.wmnet
  • yarn.wikimedia.org - an-tool1008.eqiad.wmnet
  • turnilo.wikimedia.org - an-tool1007.eqiad.wmnet
  • superset.wikimedia.org - analytics-tool1004.eqiad.wmnet
  • analytics.wikimedia.org - thorium.eqiad.wmnet
  • stats.wikimedia.org - thorium.eqiad.wmnet
  • piwik.wikimedia.org - matomo1002.eqiad.wmnet

More context:

We want the services listed in T227860 to go through Envoy instead of directly to Nginx on the service hosts. Envoy will do TLS termination, and then proxy the request to the backend service over http instead of https. We can then remove our custom Nginx based TLS termination from the backend service boxes.

Event Timeline

fdans triaged this task as Medium priority.Dec 23 2019, 5:06 PM
fdans moved this task from Incoming to Ops Week on the Analytics board.

@razzi, I only know how this works about 60%. You should probably thoroughly read through the puppet code starting in profile::tlsproxy::envoy. Trace it all the way down through included classes and try to figure out what is going on. I expect it is installing envoy and setting up certificates and keys and configs to terminate TLS traffic and then proxy to a local HTTP port.

We should ask questions in #wikimedia-sre for help on this when we need it.

@razzi a little bit of context on this, if not clear please ask any questions in the task and we'll try to answerer asap (or we could chat over meet, no problem).

Current settings

We have several UIs for analytics, all indicated in the description. They are all set up with the same settings. Let's check one example, turnilo:

  • the client makes a request to https://turnilo.wikimedia.org
  • dns resolution happens, the client gets the IP of a LVS wikimedia host (our load balancers) that in turn forward the TCP connection to our caching layer (cpXXXX hosts).
  • the TCP connection carrying the https request lands to a cpXXXX hosts, depending on the user location (esams, ulsfo, etc..). In your case it will probably be the ulsfo caching layer, in my case esams (easy to check with traceroute, see some examples in https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue)
  • At this point, ATS (Apache Traffic Server) terminates the TLS connection for turnilo.wikimedia.org, and looks up what hosts runs the related backend service, in our case an-tool1007 (see hieradata/common/profile/trafficserver/backend.yaml in puppet).
  • If you are hitting ATS in uslfo (SFO), a connection to eqiad (IAD) needs to happen, and we use TLS since it is a cross dc call. In order for ATS to set up a TLS connection with an-tool1007, there must be an http server on the latter that also offers a valid TLS certificate for turnilo.wikimedia.org.
  • This last certificate is signed by the puppet CA, only used for our internal tools, it is not valid for external clients. In our case we do something probably not optimal, namely we have only one TLS certificate with several SANs (so valid for multiple domains). You will not need to care about this since it is already used/created, but keep it in mind that it exists.
  • So on an-tool1007 we have 3 things: nginx, httpd and nodejs. The first one is a little/lightweight proxy that does only TLS termination for the connection from ATS, proxying all the traffic to httpd, that is the one that guards the nodejs service (turnilo) with SSO authentication (the famous CAS).
  • Why do we have nginx and httpd? Couldn't we have only one? Yes we could, but it is easier to have, for example, a lightweight proxy like nginx that does only one thing (TLS termination) and that can be reused as is in multiple places (caching nodes, mediawiki nodes, analytics nodes, etc..).
  • nginx was the TLS terminator used by the caching nodes (now replaced by ATS itself), since Varnish doesn't support TLS, and it has been reused in multiple places over time (including our hosts). For a lot of reasons we have to replace it with envoy, that basically does the same thing but it is more supported in Wikimedia (tool of choice used for cross TLS conns between services in kubernetes for example, etc..).

This task is about replacing all the instances of nginx in analytics with envoy. Since both are only doing TLS termination, they are very easy to config, and it should be a matter or replacing puppet profiles (and doing some cleanup).

For example, let's use again turnilo. We can start from role::analytics_cluster::turnilo, that includes several profiles, like profile::tlsproxy::service. If you check the profile, what it does is basically setting up nginx with configs needed by turnilo, namely:

  • what is the port of the backed (in this case, httpd)
  • what is the TLS certificate of the backend

Where are the above configured? In the hiera config for the role:

# TLS Terminator settings
# Please note: the service name is used to pick up the
# TLS certificate tha nginx will use. Since the Analytics
# team has only one certificate for its UIs, then the service
# name listed does not reflect how the nginx proxy is set up.
# turnilo.wikimedia.org and pivot.wikimedia.org are SANs
# of the yarn.wikimedia.org TLS certificate.
profile::tlsproxy::instance::ssl_compatibility_mode: strong
profile::tlsproxy::service::cert_domain_name: yarn.wikimedia.org
profile::tlsproxy::service::upstream_ports:
  - 80
profile::tlsproxy::service::check_uri: "/health_check"
profile::tlsproxy::service::check_service: "turnilo.wikimedia.org"
profile::tlsproxy::service::notes_url: "https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster"
profile::tlsproxy::service::contact_group: 'analytics'

You can explore all the options, one is very weird and I want to add more info: profile::tlsproxy::service::cert_domain_name: yarn.wikimedia.org

Why yarn.wikimedia.org and not turnilo.wikimedia.org ? This is due to the choice that I mentioned above about using one TLS certificate with multiple SANs. In this case the cert has domain yarn.w.o, but among its SANs turnilo.w.o, acting basically as multi-domain TLS cert. It was not the best choice, we may want to change it in the future.

Anyway, there is another profile called profile::tlsproxy::envoy, that is similar to profile::tlsproxy::service, so we should come up with a procedure to swap all occurrences of the latter with the former in our hosts/vms.

Change 633227 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] turnilo: switch from nginx to envoy for tls termination

https://gerrit.wikimedia.org/r/633227

Change 633227 merged by Razzi:
[operations/puppet@production] yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org

https://gerrit.wikimedia.org/r/633227

Change 634306 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] yarn: replace nginx with envoy for tls

https://gerrit.wikimedia.org/r/634306

Change 634306 merged by Razzi:
[operations/puppet@production] yarn: replace nginx with envoy for tls

https://gerrit.wikimedia.org/r/634306

Mentioned in SAL (#wikimedia-analytics) [2020-10-15T17:57:09Z] <razzi> taking yarn.wikimedia.org offline momentarily to test new tls configuration: T240439

@Ottomata and I tested and deployed the envoy changes on yarn.wikimedia.org.

To test, we added envoy on unprivileged port 8443, created a local ssh tunnel with ssh -N an-tool1008.eqiad.wmnet -L 8443:127.0.0.1:8443, then navigated to https://localhost:8443, and were presented with the cas login as expected.

To deploy, we created a puppet change moving the envoy port to 443 and removing the nginx profile. We caused a moment of downtime as we stopped nginx and ran puppet agent on the node; when the puppet changed had applied, everything worked as expected.

We attempted to remove the nginx apt packages, and found /var/lib/nginx was mounted as a tmpfs; unmounting it with sudo umount /var/lib/nginx allowed sudo apt-get purge nginx-common nginx-full to complete, and in addition to the apt packages, that removed nginx config and the systemd units.

The procedure should work the same on the various other hosts. There will be downtime as we stop nginx and apply the puppet changes, so the next steps are to pick a time to update the internal hosts and communicate that out. For stats.wikimedia.org, an external-facing service, we should find a way to do this with minimal downtime and consider notifying the community in advance that this will be happening.

@razzi for all the sites for stats.wikimedia.org, I think you can just pick a time (next week maybe?) and send an email to analytics-announce@lists.wikimedia.org communicating it, and then migrate them. stats.wikimedia.org can be done the same way, but should be communicated out to different channels, although I'm not totally sure which ones. Maybe @Milimetric or @fdans knows?

@razzi @Ottomata I am ok with the procedure, nice work! For stats.wikimedia.org we can do as I suggested, namely having envoy on 8443 and change settings for ATS. It is surely a little longer but it will not require any downtime or maintenance announce schedule (and also it would give to Razzi the possibility to experiment with a change to ATS, so talking with the Traffic team, etc..). Anyway, I'll let you two decide :)

Something really AWESOME that I just discovered is https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=analytics&var-origin_instance=All&var-destination=All

Prometheus is instructed to poll all hosts including profile::envoy, so we are going to get a lot of useful metrics after this migration!

@elukey I like that plan to keep both proxies running and switch ATS to 8443.

Change 634660 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] hue: switch from nginx to envoy for tls

https://gerrit.wikimedia.org/r/634660

Change 634661 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] turnilo: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634661

Change 634662 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] superset: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634662

Change 634664 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] piwik: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634664

Change 634667 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634667

Change 634669 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: temporarily switch analytics sites to port 8443

https://gerrit.wikimedia.org/r/634669

I made patches for the various hosts that need this upgrade. The one that will be trickiest will be stats.wikimedia.org (which also hosts analytics.wikimedia.org and datasets.wikimedia.org) - for that, I have 2 patches

At this point, everything will be working, and we can disable nginx manually, then add a patch to remove nginx from the config. My hope is that if we then change the config to run envoy on port 443, puppet will not "clean up" the envoy process on 8443, so that there will actually be 2 envoy processes on one host. Then we can switch traffic back to the envoy process on port 443 and manually stop the one on port 8443, and we'll be done.

After discussing with @elukey, we can leave stats.wikimedia.org running on port 8443, since it's not an address end users will see.

@razzi as note for the future, another useful test to do is via openssl s_client, like the following:

echo y | openssl s_client -CApath /etc/ssl/certs/ -connect analytics-tool1001.eqiad.wmnet:443
echo y | openssl s_client -CApath /etc/ssl/certs/ -connect analytics-tool1001.eqiad.wmnet:443 |  openssl x509 -text

The former is just a test to see if the certificate is valid, useful if you run it from say a cpxxxx node (where ATS runs).

In the latter, two things are worth to check in that output:

  1. TLS SANS
X509v3 Subject Alternative Name:
    DNS:yarn.wikimedia.org, DNS:hue.wikimedia.org, DNS:hue-next.wikimedia.org, DNS:superset.wikimedia.org, DNS:pivot.wikimedia.org, DNS:turnilo.wikimedia.org, DNS:stats.wikimedia.org, DNS:analytics.wikimedia.org, DNS:piwik.wikimedia.org, DNS:datasets.wikimedia.org

If the domain that you are testing is not in there, then (in our config of one multi-purpose cert for multiple domains) ATS will likely fail to establish a TLS conn.

  1. CA that issued the certificate
Issuer: CN = Puppet CA: palladium.eqiad.wmnet

This indicates that the CA is the one that you expect (in this case, the puppet internal one),

@razzi something useful to do in the task is also to make a list of domain -> backend that you will work on, so others can double check. Something like:

  • stats.wikimedia.org -> analytics-tool1001
  • etc..

For example, Hue runs on two nodes (hue.wikimedia.org and hue-next.wikimedia.org), so you'll have two swap moves to do instead of ones. Having a list in here will avoid confusion when executing the plan :)

razzi updated the task description. (Show Details)

Good idea @elukey - added.

Change 634664 merged by Razzi:
[operations/puppet@production] piwik: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634664

Change 634662 merged by Razzi:
[operations/puppet@production] superset: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634662

Change 634661 merged by Razzi:
[operations/puppet@production] turnilo: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634661

Change 634660 merged by Razzi:
[operations/puppet@production] hue: switch from nginx to envoy for tls

https://gerrit.wikimedia.org/r/634660

Change 634667 merged by Razzi:
[operations/puppet@production] stats: Add envoy on port 8443 alongside nginx

https://gerrit.wikimedia.org/r/634667

Very strange, when I tested piwik.wikimedia.org I get a white-background screen with upstream connect error or disconnect/reset before headers. reset reason: connection termination. The connection was redirecting me to idp.wikimedia.org. Then I restarted httpd and everything started working again.. I checked on the logs and I didn't see much, so not sure what was happening. Let's keep an eye on it.

Change 634669 merged by Razzi:
[operations/puppet@production] stats: switch analytics sites to use Envoy on port 8443

https://gerrit.wikimedia.org/r/634669

Change 636514 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: Remove nginx from thorium

https://gerrit.wikimedia.org/r/636514

Change 636514 merged by Razzi:
[operations/puppet@production] stats: Remove nginx from thorium

https://gerrit.wikimedia.org/r/636514

Change 638185 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] nginx: Remove profile::tlsproxy::service

https://gerrit.wikimedia.org/r/638185

Change 638185 merged by Razzi:
[operations/puppet@production] nginx: Remove profile::tlsproxy::service

https://gerrit.wikimedia.org/r/638185