Page MenuHomePhabricator

Envoy and swift HEAD with 204 response turns into 503
Closed, ResolvedPublic

Description

I am debugging an interaction between Swift and Envoy, specifically when fetching account statistics from swift an HEAD request is issued and Swift answer with 204 + headers (below). The request is fine from Swift's perspective but the client talking to Envoy gets a 503.

thanos-fe1001:~#    curl -k -i https://thanos-swift.discovery.wmnet/v1/AUTH_tegola --resolve thanos-swift.discovery.wmnet:443:127.0.0.1 -I -H "X-Auth-Token: AUTH_XXX"
HTTP/1.1 503 Service Unavailable
content-length: 95
content-type: text/plain
date: Fri, 13 Aug 2021 12:11:16 GMT
server: envoy

This is the response from swift:

HTTP/1.1 204 No Content
Content-Type: text/plain; charset=utf-8
X-Account-Object-Count: 8787256
X-Account-Storage-Policy-Standard-Container-Count: 2
X-Timestamp: 1626432137.40592
X-Account-Storage-Policy-Standard-Object-Count: 8787256
X-Account-Bytes-Used: 26354037115
X-Account-Container-Count: 2
X-Account-Storage-Policy-Standard-Bytes-Used: 26354037115
Accept-Ranges: bytes
Vary: Accept
X-Trans-Id: txaf6619fbb7314c188b660-006116623d
X-Openstack-Request-Id: txaf6619fbb7314c188b660-006116623d
Date: Fri, 13 Aug 2021 12:14:53 GMT
Transfer-Encoding: chunked

Apparently what confuses Envoy is TE or content-length on 204. The bandaid is to set envoy.reloadable_features.strict_1xx_and_204_response_headers=false which indeed does what it says:

thanos-fe1001:~# curl -X POST 'localhost:9631/runtime_modify?envoy.reloadable_features.strict_1xx_and_204_response_headers=false'
OK

Wait some time for the config to be live (?)

thanos-fe1001:~#    curl -k -i https://thanos-swift.discovery.wmnet/v1/AUTH_tegola --resolve thanos-swift.discovery.wmnet:443:127.0.0.1 -I -H "X-Auth-Token: AUTH_XXX"
HTTP/1.1 204 No Content
content-type: text/plain; charset=utf-8
x-account-object-count: 8787256
x-account-storage-policy-standard-container-count: 2
x-timestamp: 1626432137.40592
x-account-storage-policy-standard-object-count: 8787256
x-account-bytes-used: 26354037115
x-account-container-count: 2
x-account-storage-policy-standard-bytes-used: 26354037115
accept-ranges: bytes
vary: Accept
x-trans-id: txa82f71b6be5849e4b8515-0061166302
x-openstack-request-id: txa82f71b6be5849e4b8515-0061166302
date: Fri, 13 Aug 2021 12:18:10 GMT
x-envoy-upstream-service-time: 16
server: envoy
transfer-encoding: chunked

However I haven't been able to quite figure out what I'm supposed to put in the config to make this setting permanent.

I've opened a bug with Swift upstream for further enlightenment https://bugs.launchpad.net/swift/+bug/1939888

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-08-13T12:53:54Z] <godog> set runtime envoy.reloadable_features.strict_1xx_and_204_response_headers=false on thanos-fe* - T288815

Summarizing the discussion from IRC:

  • "Permanent" is relative -- it looks like this only exists as a runtime option for temporary backward compatibility, and it'll probably be removed in a future Envoy release. The upstream fix with Swift is the best way to go about handling this in the long term. (In the intermediate term, the setting gets renamed from strict_1xx_and_204_response_headers to require_strict_1xx_and_204_response_headers in Envoy 1.19, so we'll have to keep up with that if we're still using it then.)
  • But we can at least make it survive an Envoy restart: We can add the runtime option to our static config, so that it's picked up on startup -- the envoy.yaml stanza should look like this (at the top level, so layered_runtime is a sibling of static_resources).
layered_runtime:
  layers:
    - name: static_layer
      static_layer:
        envoy.reloadable_features.strict_1xx_and_204_response_headers: false
    # Include an empty "admin layer" *after* the static layer, so that we can continue to make changes via the admin console and they'll overwrite values from the previous layer.
    - name: admin_layer
      admin_layer: {}
  • We haven't used that config field previously, so we'll have to modify build_envoy_config.py to write it (presumably in a generic way that reads in zero or more runtime values from somewhere).

Change 713504 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] envoyproxy: Add $runtime field to set a static runtime layer.

https://gerrit.wikimedia.org/r/713504

Change 713725 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] thanos::frontend: Disable Envoy's strict 204 header parsing

https://gerrit.wikimedia.org/r/713725

Change 713815 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: prefix Bullseye pipelines with proxy-logging

https://gerrit.wikimedia.org/r/713815

Change 713504 merged by RLazarus:

[operations/puppet@production] envoyproxy: Add $runtime field to set a static runtime layer.

https://gerrit.wikimedia.org/r/713504

Change 713815 merged by Filippo Giunchedi:

[operations/puppet@production] swift: prefix Bullseye pipelines with proxy-logging

https://gerrit.wikimedia.org/r/713815

Mentioned in SAL (#wikimedia-operations) [2021-08-20T08:48:31Z] <godog> roll depool/pool thanos-fe to apply swift change - T288815

I've deployed the fix from Swift upstream and it is working (i.e. Swift DTRT and Envoy's happy). @RLazarus I believe we're okay to resolve the task and abandon https://gerrit.wikimedia.org/r/713725 ?

Sounds good to me! That means the $runtime field is unused anywhere, but I think it's a useful knob to have, so I'll leave it in place.

Change 713725 abandoned by RLazarus:

[operations/puppet@production] thanos::frontend: Disable Envoy's strict 204 header parsing

Reason:

https://gerrit.wikimedia.org/r/713725