Move https termination from nginx to envoy (if possible)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Dec 11 2019, 11:02 AM

Description

In T227860 we added HTTPS support to Analytics UIs using nginx. This has revealed some missing support in puppet for Buster, and overall it seems that nobody will maintain TLS settings for nginx in the long term (since Traffic moved to ATS). The other services are using envoy, so we should investigate if using it is feasible.

Services to do:

hue.wikimedia.org - analytics-tool1001.eqiad.wmnet
hue-next.wikimedia.org - an-tool1009.eqiad.wmnet
yarn.wikimedia.org - an-tool1008.eqiad.wmnet
turnilo.wikimedia.org - an-tool1007.eqiad.wmnet
superset.wikimedia.org - analytics-tool1004.eqiad.wmnet
analytics.wikimedia.org - thorium.eqiad.wmnet
stats.wikimedia.org - thorium.eqiad.wmnet
piwik.wikimedia.org - matomo1002.eqiad.wmnet

More context:

We want the services listed in T227860 to go through Envoy instead of directly to Nginx on the service hosts. Envoy will do TLS termination, and then proxy the request to the backend service over http instead of https. We can then remove our custom Nginx based TLS termination from the backend service boxes.

Details

Subject	Repo	Branch	Lines +/-
nginx: Remove profile::tlsproxy::service	operations/puppet	production	+0 -81
stats: Remove nginx from thorium	operations/puppet	production	+2 -14
stats: switch analytics sites to use Envoy on port 8443	operations/puppet	production	+3 -3
stats: Add envoy on port 8443 alongside nginx	operations/puppet	production	+14 -3
hue: switch from nginx to envoy for tls	operations/puppet	production	+7 -8
turnilo: use envoy instead of nginx for tls	operations/puppet	production	+9 -11
superset: use envoy instead of nginx for tls	operations/puppet	production	+9 -11
piwik: use envoy instead of nginx for tls	operations/puppet	production	+9 -11
yarn: replace nginx with envoy for tls	operations/puppet	production	+0 -12
yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org	operations/puppet	production	+12 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		odimitrijevic	T240437 Analytics Ops Technical Debt
		Resolved		• razzi	T240439 Move https termination from nginx to envoy (if possible)

Event Timeline

elukey created this task.Dec 11 2019, 11:02 AM

I will also need to do this for the schema service too! https://phabricator.wikimedia.org/T233630#5644875

• fdans triaged this task as Medium priority.Dec 23 2019, 5:06 PM

• fdans moved this task from Incoming to Ops Week on the Analytics board.

elukey added a project: User-Elukey.Jan 3 2020, 11:05 AM

elukey moved this task from Ops Week to Operational Excellence on the Analytics board.Feb 4 2020, 9:54 AM

• Nuria assigned this task to • razzi.Sep 14 2020, 6:42 PM

Ottomata updated the task description. (Show Details)Sep 14 2020, 7:06 PM

@razzi, I only know how this works about 60%. You should probably thoroughly read through the puppet code starting in profile::tlsproxy::envoy. Trace it all the way down through included classes and try to figure out what is going on. I expect it is installing envoy and setting up certificates and keys and configs to terminate TLS traffic and then proxy to a local HTTP port.

We should ask questions in #wikimedia-sre for help on this when we need it.

MoritzMuehlenhoff subscribed.Sep 22 2020, 1:02 PM

@razzi a little bit of context on this, if not clear please ask any questions in the task and we'll try to answerer asap (or we could chat over meet, no problem).

Current settings

We have several UIs for analytics, all indicated in the description. They are all set up with the same settings. Let's check one example, turnilo:

the client makes a request to https://turnilo.wikimedia.org
dns resolution happens, the client gets the IP of a LVS wikimedia host (our load balancers) that in turn forward the TCP connection to our caching layer (cpXXXX hosts).
the TCP connection carrying the https request lands to a cpXXXX hosts, depending on the user location (esams, ulsfo, etc..). In your case it will probably be the ulsfo caching layer, in my case esams (easy to check with traceroute, see some examples in https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue)
At this point, ATS (Apache Traffic Server) terminates the TLS connection for turnilo.wikimedia.org, and looks up what hosts runs the related backend service, in our case an-tool1007 (see hieradata/common/profile/trafficserver/backend.yaml in puppet).
If you are hitting ATS in uslfo (SFO), a connection to eqiad (IAD) needs to happen, and we use TLS since it is a cross dc call. In order for ATS to set up a TLS connection with an-tool1007, there must be an http server on the latter that also offers a valid TLS certificate for turnilo.wikimedia.org.
This last certificate is signed by the puppet CA, only used for our internal tools, it is not valid for external clients. In our case we do something probably not optimal, namely we have only one TLS certificate with several SANs (so valid for multiple domains). You will not need to care about this since it is already used/created, but keep it in mind that it exists.
So on an-tool1007 we have 3 things: nginx, httpd and nodejs. The first one is a little/lightweight proxy that does only TLS termination for the connection from ATS, proxying all the traffic to httpd, that is the one that guards the nodejs service (turnilo) with SSO authentication (the famous CAS).
Why do we have nginx and httpd? Couldn't we have only one? Yes we could, but it is easier to have, for example, a lightweight proxy like nginx that does only one thing (TLS termination) and that can be reused as is in multiple places (caching nodes, mediawiki nodes, analytics nodes, etc..).
nginx was the TLS terminator used by the caching nodes (now replaced by ATS itself), since Varnish doesn't support TLS, and it has been reused in multiple places over time (including our hosts). For a lot of reasons we have to replace it with envoy, that basically does the same thing but it is more supported in Wikimedia (tool of choice used for cross TLS conns between services in kubernetes for example, etc..).

This task is about replacing all the instances of nginx in analytics with envoy. Since both are only doing TLS termination, they are very easy to config, and it should be a matter or replacing puppet profiles (and doing some cleanup).

For example, let's use again turnilo. We can start from role::analytics_cluster::turnilo, that includes several profiles, like profile::tlsproxy::service. If you check the profile, what it does is basically setting up nginx with configs needed by turnilo, namely:

what is the port of the backed (in this case, httpd)
what is the TLS certificate of the backend

Where are the above configured? In the hiera config for the role:

# TLS Terminator settings
# Please note: the service name is used to pick up the
# TLS certificate tha nginx will use. Since the Analytics
# team has only one certificate for its UIs, then the service
# name listed does not reflect how the nginx proxy is set up.
# turnilo.wikimedia.org and pivot.wikimedia.org are SANs
# of the yarn.wikimedia.org TLS certificate.
profile::tlsproxy::instance::ssl_compatibility_mode: strong
profile::tlsproxy::service::cert_domain_name: yarn.wikimedia.org
profile::tlsproxy::service::upstream_ports:
  - 80
profile::tlsproxy::service::check_uri: "/health_check"
profile::tlsproxy::service::check_service: "turnilo.wikimedia.org"
profile::tlsproxy::service::notes_url: "https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster"
profile::tlsproxy::service::contact_group: 'analytics'

You can explore all the options, one is very weird and I want to add more info: profile::tlsproxy::service::cert_domain_name: yarn.wikimedia.org

Why yarn.wikimedia.org and not turnilo.wikimedia.org ? This is due to the choice that I mentioned above about using one TLS certificate with multiple SANs. In this case the cert has domain yarn.w.o, but among its SANs turnilo.w.o, acting basically as multi-domain TLS cert. It was not the best choice, we may want to change it in the future.

Anyway, there is another profile called profile::tlsproxy::envoy, that is similar to profile::tlsproxy::service, so we should come up with a procedure to swap all occurrences of the latter with the former in our hosts/vms.

Change 633227 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] turnilo: switch from nginx to envoy for tls termination

https://gerrit.wikimedia.org/r/633227

gerritbot added a project: Patch-For-Review.Oct 9 2020, 7:14 PM

Change 633227 merged by Razzi:
[operations/puppet@production] yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org

https://gerrit.wikimedia.org/r/633227

Change 634306 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] yarn: replace nginx with envoy for tls

https://gerrit.wikimedia.org/r/634306

Change 634306 merged by Razzi:
[operations/puppet@production] yarn: replace nginx with envoy for tls

https://gerrit.wikimedia.org/r/634306

Mentioned in SAL (#wikimedia-analytics) [2020-10-15T17:57:09Z] <razzi> taking yarn.wikimedia.org offline momentarily to test new tls configuration: T240439

Maintenance_bot removed a project: Patch-For-Review.Oct 15 2020, 6:10 PM

• razzi updated the task description. (Show Details)Oct 15 2020, 6:26 PM

@Ottomata and I tested and deployed the envoy changes on yarn.wikimedia.org.

To test, we added envoy on unprivileged port 8443, created a local ssh tunnel with ssh -N an-tool1008.eqiad.wmnet -L 8443:127.0.0.1:8443, then navigated to https://localhost:8443, and were presented with the cas login as expected.

To deploy, we created a puppet change moving the envoy port to 443 and removing the nginx profile. We caused a moment of downtime as we stopped nginx and ran puppet agent on the node; when the puppet changed had applied, everything worked as expected.

We attempted to remove the nginx apt packages, and found /var/lib/nginx was mounted as a tmpfs; unmounting it with sudo umount /var/lib/nginx allowed sudo apt-get purge nginx-common nginx-full to complete, and in addition to the apt packages, that removed nginx config and the systemd units.

The procedure should work the same on the various other hosts. There will be downtime as we stop nginx and apply the puppet changes, so the next steps are to pick a time to update the internal hosts and communicate that out. For stats.wikimedia.org, an external-facing service, we should find a way to do this with minimal downtime and consider notifying the community in advance that this will be happening.

@razzi for all the sites for stats.wikimedia.org, I think you can just pick a time (next week maybe?) and send an email to analytics-announce@lists.wikimedia.org communicating it, and then migrate them. stats.wikimedia.org can be done the same way, but should be communicated out to different channels, although I'm not totally sure which ones. Maybe @Milimetric or @fdans knows?

@razzi @Ottomata I am ok with the procedure, nice work! For stats.wikimedia.org we can do as I suggested, namely having envoy on 8443 and change settings for ATS. It is surely a little longer but it will not require any downtime or maintenance announce schedule (and also it would give to Razzi the possibility to experiment with a change to ATS, so talking with the Traffic team, etc..). Anyway, I'll let you two decide :)

Something really AWESOME that I just discovered is https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=analytics&var-origin_instance=All&var-destination=All

Prometheus is instructed to poll all hosts including profile::envoy, so we are going to get a lot of useful metrics after this migration!

@elukey I like that plan to keep both proxies running and switch ATS to 8443.

• razzi updated the task description. (Show Details)Oct 16 2020, 5:45 PM

Change 634660 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] hue: switch from nginx to envoy for tls

https://gerrit.wikimedia.org/r/634660

gerritbot added a project: Patch-For-Review.Oct 16 2020, 11:10 PM

Change 634661 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] turnilo: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634661

Change 634662 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] superset: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634662

Change 634664 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] piwik: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634664

Change 634667 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634667

Change 634669 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: temporarily switch analytics sites to port 8443

https://gerrit.wikimedia.org/r/634669

I made patches for the various hosts that need this upgrade. The one that will be trickiest will be stats.wikimedia.org (which also hosts analytics.wikimedia.org and datasets.wikimedia.org) - for that, I have 2 patches

run envoy at port 8443 alongside nginx at port 443: https://gerrit.wikimedia.org/r/c/operations/puppet/+/634667
switch ATS to use envoy on port 8443: https://gerrit.wikimedia.org/r/c/operations/puppet/+/634669

At this point, everything will be working, and we can disable nginx manually, then add a patch to remove nginx from the config. My hope is that if we then change the config to run envoy on port 443, puppet will not "clean up" the envoy process on 8443, so that there will actually be 2 envoy processes on one host. Then we can switch traffic back to the envoy process on port 443 and manually stop the one on port 8443, and we'll be done.

After discussing with @elukey, we can leave stats.wikimedia.org running on port 8443, since it's not an address end users will see.

@razzi as note for the future, another useful test to do is via openssl s_client, like the following:

echo y | openssl s_client -CApath /etc/ssl/certs/ -connect analytics-tool1001.eqiad.wmnet:443
echo y | openssl s_client -CApath /etc/ssl/certs/ -connect analytics-tool1001.eqiad.wmnet:443 |  openssl x509 -text

The former is just a test to see if the certificate is valid, useful if you run it from say a cpxxxx node (where ATS runs).

In the latter, two things are worth to check in that output:

TLS SANS

X509v3 Subject Alternative Name:
    DNS:yarn.wikimedia.org, DNS:hue.wikimedia.org, DNS:hue-next.wikimedia.org, DNS:superset.wikimedia.org, DNS:pivot.wikimedia.org, DNS:turnilo.wikimedia.org, DNS:stats.wikimedia.org, DNS:analytics.wikimedia.org, DNS:piwik.wikimedia.org, DNS:datasets.wikimedia.org

If the domain that you are testing is not in there, then (in our config of one multi-purpose cert for multiple domains) ATS will likely fail to establish a TLS conn.

CA that issued the certificate

Issuer: CN = Puppet CA: palladium.eqiad.wmnet

This indicates that the CA is the one that you expect (in this case, the puppet internal one),

@razzi something useful to do in the task is also to make a list of domain -> backend that you will work on, so others can double check. Something like:

stats.wikimedia.org -> analytics-tool1001
etc..

For example, Hue runs on two nodes (hue.wikimedia.org and hue-next.wikimedia.org), so you'll have two swap moves to do instead of ones. Having a list in here will avoid confusion when executing the plan :)

Good idea @elukey - added.

Change 634664 merged by Razzi:
[operations/puppet@production] piwik: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634664

Change 634662 merged by Razzi:
[operations/puppet@production] superset: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634662

Change 634661 merged by Razzi:
[operations/puppet@production] turnilo: use envoy instead of nginx for tls

https://gerrit.wikimedia.org/r/634661

Change 634660 merged by Razzi:
[operations/puppet@production] hue: switch from nginx to envoy for tls

https://gerrit.wikimedia.org/r/634660

• razzi updated the task description. (Show Details)Oct 21 2020, 8:10 PM

• razzi updated the task description. (Show Details)Oct 21 2020, 8:14 PM

Change 634667 merged by Razzi:
[operations/puppet@production] stats: Add envoy on port 8443 alongside nginx

https://gerrit.wikimedia.org/r/634667

Very strange, when I tested piwik.wikimedia.org I get a white-background screen with upstream connect error or disconnect/reset before headers. reset reason: connection termination. The connection was redirecting me to idp.wikimedia.org. Then I restarted httpd and everything started working again.. I checked on the logs and I didn't see much, so not sure what was happening. Let's keep an eye on it.

Change 634669 merged by Razzi:
[operations/puppet@production] stats: switch analytics sites to use Envoy on port 8443

https://gerrit.wikimedia.org/r/634669

• razzi updated the task description. (Show Details)Oct 26 2020, 8:02 PM

Change 636514 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] stats: Remove nginx from thorium

https://gerrit.wikimedia.org/r/636514

elukey added a project: Analytics-Kanban.Oct 28 2020, 7:07 AM

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 636514 merged by Razzi:
[operations/puppet@production] stats: Remove nginx from thorium

https://gerrit.wikimedia.org/r/636514

Change 638185 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] nginx: Remove profile::tlsproxy::service

https://gerrit.wikimedia.org/r/638185

Change 638185 merged by Razzi:
[operations/puppet@production] nginx: Remove profile::tlsproxy::service

https://gerrit.wikimedia.org/r/638185

• razzi moved this task from In Progress to Done on the Analytics-Kanban board.Nov 5 2020, 7:06 PM

• fdans closed this task as Resolved.Nov 16 2020, 3:51 PM

Move https termination from nginx to envoy (if possible)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move https termination from nginx to envoy (if possible)
Closed, ResolvedPublic
Actions

Related Objects
Search...