Page MenuHomePhabricator

Using port in Host header for thanos-swift / thanos-query breaks vhost selection
Open, MediumPublic

Description

While debugging probes in the parent task I came across this behaviour:

  • thanos-fe hosts have TLS terminated by envoy with two virtual hosts: thanos-swift and thanos-query (with corresponding svc and discovery entries)
  • we're probing both services by:
    • connecting to thanos-swift.svc.SITE.wmnet / thanos-query.svc.SITE.wmnet
    • sending the corresponding thanos-swift / thanos-query discovery names as SNI
    • using the svc names plus port as Host headers (though using discovery entries doesn't change behaviour)

I noticed that including the port number seems to break vhost selection: (404 I suspect is because thanos-query is selected as a vhost and /healthcheck doesn't exist there)

prometheus1003:~$ curl --header 'Host: thanos-swift.discovery.wmnet:443'  https://thanos-swift.discovery.wmnet:443/healthcheck -v 2>&1 | grep HTTP/
> GET /healthcheck HTTP/1.1
< HTTP/1.1 404 Not Found
prometheus1003:~$ curl --header 'Host: thanos-swift.discovery.wmnet'  https://thanos-swift.discovery.wmnet:443/healthcheck -v 2>&1 | grep HTTP/
> GET /healthcheck HTTP/1.1
< HTTP/1.1 200 OK

cc @JMeybohm @Joe in case you have come across this before ?

Event Timeline

Looking at the configuration on thanos-fe1001, I see that the listener has:

  • SNI support only for thanos-query
  • NON-SNI support for both thanos-query and thanos-swift
  • The non-sni support has domain-based routing and that seems to break down if you add the port in the Host header

I suspect that fixing the SNI setup (or fixing the envoy manifests if the error is there!) should resolve your immediate problem.

I'll take a further look

Joe triaged this task as Medium priority.Jan 26 2022, 12:22 PM
Joe added projects: serviceops, envoy.

Change 757452 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] thanos::frontend: fix envoy configuration

https://gerrit.wikimedia.org/r/757452

For the record, what happened here is as follows:

  • envoy had thanos-query registered via SNI, then a non-sni virtualhost for both sites
  • Any request with Host: <hostname>:port to the non-sni virtualhost would fail because of a bug (still to fix!) in the envoy configuration.

So for instance:

# Calls thanos-query, SNI vhost responds, correctly directs to the backend
$ curl --header 'Host: thanos-query.discovery.wmnet:443'  https://thanos-query.discovery.wmnet:443/healthcheck -Is | grep 'HTTP\|server'
HTTP/1.1 404 Not Found
server: Apache
# Calls thanos-swift, so falls back to the non-sni vhost in envoy that doesn't recognize the host:port format
$ curl --header 'Host: thanos-query.discovery.wmnet:443'  https://thanos-swift.discovery.wmnet:443/healthcheck -Is | grep 'HTTP\|server'
HTTP/1.1 404 Not Found
server: envoy
# Same thing happens for thanos-swift in the host header, it's again the envoy vhost failing to recognize ports
$ curl --header 'Host: thanos-swift.discovery.wmnet:443'  https://thanos-swift.discovery.wmnet:443/healthcheck -Is | grep 'HTTP\|server'
HTTP/1.1 404 Not Found
server: envoy

I'll now merge the patch adding proper SNI support to both backends, which should fix the issue at least for SNI-enabled clients

Change 757452 merged by Giuseppe Lavagetto:

[operations/puppet@production] thanos::frontend: fix envoy configuration

https://gerrit.wikimedia.org/r/757452

Mentioned in SAL (#wikimedia-operations) [2022-02-02T11:28:16Z] <_joe_> depooling thanos-fe1001 for testing T300119

Results after merging the patch on thanos-fe1001 solve our immediate problem for SNI clients.

Specifically:

$ curl --resolve thanos-swift.discovery.wmnet:443:$(dig +short thanos-fe1001.eqiad.wmnet) -Is --header 'Host: thanos-swift.discovery.wmnet:443'  https://thanos-swift.discovery.wmnet:443/healthcheck | grep 'HTTP\|server'
HTTP/1.1 200 OK
server: envoy

but non-sni clients would still have the same issue if the header included a port.

In fact using the IP addess disables/bypasses SNI, and we still get a 404:

$ curl -Iks --header 'Host: thanos-swift.discovery.wmnet:443'  https://$(dig +short thanos-fe1001.eqiad.wmnet):443/healthcheck 
HTTP/1.1 404 Not Found
date: Wed, 02 Feb 2022 11:42:45 GMT
server: envoy
transfer-encoding: chunked

Given the "expected" if not necessarily correct behaviour has been introduced in envoy in https://github.com/envoyproxy/envoy/pull/10960, this should be naturally fixed once T300324 is underway.

Actually I'll suggest there we start from the thanos-fe cluster with the upgrade to envoy 1.18 as that should fix the issue.

I just upgraded thanos-fe to envoy 1.18.3, but out of the box I see the same behavior:

rzl@cumin2002:~$ curl --resolve thanos-swift.discovery.wmnet:443:$(dig +short thanos-fe1001.eqiad.wmnet) -Is --header 'Host: thanos-swift.discovery.wmnet:443'  https://thanos-swift.discovery.wmnet:443/healthcheck | grep 'HTTP\|server'
HTTP/1.1 200 OK
server: envoy

rzl@cumin2002:~$ curl -Iks --header 'Host: thanos-swift.discovery.wmnet:443'  https://$(dig +short thanos-fe1001.eqiad.wmnet):443/healthcheck 
HTTP/1.1 404 Not Found
date: Thu, 10 Mar 2022 18:03:37 GMT
server: envoy
transfer-encoding: chunked

I haven't dug into the linked PR to see if there's a feature flag to be enabled, but if so, we can now do so -- but only on thanos-fe and other upgraded hosts. Upgrading the rest of the fleet is still in progress.

Oh, yep, it's strip_matching_host_port in the HTTP connection manager: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto

We can flip that on for thanos-fe hosts at any time.

Change 769749 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] envoy: manage strip_matching_host_port setting and enable on thanos-fe

https://gerrit.wikimedia.org/r/769749

Hi all, with the Envoy upgrade and the new option, what's the next step ?

Change #769749 abandoned by Herron:

[operations/puppet@production] envoy: manage strip_matching_host_port setting and enable on thanos-fe

Reason:

spring cleaning -- didn't end up using this patch

https://gerrit.wikimedia.org/r/769749