Page MenuHomePhabricator

Upgrade Envoy to v1.29.12
Closed, ResolvedPublic

Description

As of this writing we're still concluding T402584 for the 1.23 -> 1.26 bump, but in parallel we can start planning the next step, 1.26 -> 1.29.

Release notes of potential interest (1.27, 1.28, 1.29)

Stats changes

These may need dashboard updates if they're in use anywhere.

  • (1.29.0) stats connection_limit: Fixed tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_connection_limit_myprefix_limited_connections{} to envoy_connection_limit_limited_connections{envoy_connection_limit_prefix="myprefix"}.
  • (1.27.0) stats tls: Fixed metric tag extraction so that TLS parameters are properly extracted from the stats, both for listeners and clusters. This changes the Prometheus names from envoy_listener_ssl_ciphers_ECDHE_RSA_AES128_GCM_SHA256{envoy_listener_address="0.0.0.0_10000"} to envoy_listener_ssl_ciphers{envoy_listener_address="0.0.0.0_10000", envoy_ssl_cipher="ECDHE_RSA_AES128_GCM_SHA256"}, and similar for envoy_listener_ssl_versions_TLSv1_2, envoy_cluster_ssl_versions_TLSv1_2, envoy_listener_ssl_curves_P_256, envoy_cluster_ssl_curves_P_256, envoy_listener_ssl_sigalgs_rsa_pss_rsae_sha256.

Config changes pre-upgrade

We should make sure we're using canonical filter names everywhere, in which case this downgrading change has no effect. If we do have any non-canonical names set, before upgrading we should either modify the config or make sure this runtime flag is set for whichever behavior we want (probably the default).

  • (1.29.0) http: Flip runtime flag envoy.reloadable_features.no_downgrade_to_canonical_name to true. Name downgrading in the per filter config searching will be disabled by default. This behavior can be temporarily reverted by setting the flag to false explicitly. See doc Http filter route specific config or issue https://github.com/envoyproxy/envoy/issues/29461 for more specific detail and examples.
  • (1.28.0) http: Introduced a new runtime flag envoy.reloadable_features.no_downgrade_to_canonical_name to disable the name downgrading in the per filter config searching. See doc Http filter route specific config or issue https://github.com/envoyproxy/envoy/issues/29461 for more specific detail and examples.

Config changes post-upgrade

We should update all our configs for these deprecated fields, before they're removed, but we can do that after the upgrade.

  • (1.28.0) listener: deprecated runtime key overload.global_downstream_max_connections in favor of downstream connections monitor.
  • (1.27.0) health_check: deprecated the HealthCheck event_log_path in favor of HealthCheck event_logger extension.

We'll need to start setting this field too, although I'll need to dig into what's appropriate based on what we're using it for (i.e. what Envoy features are actually restricted to trusted addresses only). If necessary, we could just set it to the RFC1918 ranges and call it a day; that's the old behavior, and it's only changing due to multi-tenant environment considerations, which aren't an issue for us.

  • (1.29.9) http: The default configuration of Envoy will continue to trust internal addresses while in the future it will not trust them by default. If you have tooling such as probes on your private network which need to be treated as trusted (e.g. changing arbitrary x-envoy headers) please explictily include those addresses or CIDR ranges into internal_address_config See the config examples from the above internal_address_config link. This default no trust internal address can be turned on by setting runtime guard envoy.reloadable_features.explicit_internal_address_config to true.

HTTP/1 and HTTP/2 parser changes

Probably no effect, especially since our Envoys receive no untrusted traffic, but documenting in case of edge-case behavior changes. (Note the net effect of the middle two is that oghttp2 is off by default, as it was previously, so it shouldn't make a difference yet unless an issue is caused by some of the surrounding code to support it. The change to BalsaParser for HTTP/1.1 traffic is in effect.)

  • (1.29.3) http2: Simplifies integration with the codec by removing translation between nghttp2 callbacks and Http2VisitorInterface events. Guarded by envoy.reloadable_features.http2_skip_callback_visitor.
  • (1.29.2) http2: Changes the default value of envoy.reloadable_features.http2_use_oghttp2 to false. This changes the codec used for HTTP/2 requests and responses. A number of users have reported issues with oghttp2 including issue 32611 and issue 32401 This behavior can be reverted by setting the feature to true.
  • (1.29.0) http2: Changes the default value of envoy.reloadable_features.http2_use_oghttp2 to true. This changes the codec used for HTTP/2 requests and responses. This behavior can be reverted by setting the feature to false.
  • (1.28.0) http: Switch from http_parser to BalsaParser for handling HTTP/1.1 traffic. See https://github.com/envoyproxy/envoy/issues/21245 for details. This behavioral change can be reverted by setting runtime flag envoy.reloadable_features.http1_use_balsa_parser to false.

HTTP/TLS behavior

Probably no effect, but documenting in case of edge-case behavior changes.

  • (1.29.0) http2: Discard the Host header if the :authority header was received to bring Envoy into compliance with https://www.rfc-editor.org/rfc/rfc9113#section-8.3.1 This behavioral change can be reverted by setting runtime flag envoy.reloadable_features.http2_discard_host_header to false.
  • (1.28.0) tls: changed ssl failure reason format in ssl socket for a better handling. It can be disabled by the runtime guard envoy.reloadable_features.ssl_transport_failure_reason_format.
  • (1.28.0) http: Close HTTP/2 and HTTP/3 connections that prematurely reset streams. The runtime key overload.premature_reset_min_stream_lifetime_seconds determines the interval where received stream reset is considered premature (with 1 second default). The runtime key overload.premature_reset_total_stream_count, with the default value of 500, determines the number of requests received from a connection before the check for premature resets is applied. The connection is disconnected if more than 50% of resets are premature, or if the number of suspect streams is already large enough to guarantee that more than 50% of the streams will be suspect upon reaching the total stream threshold (even if all the remaining streams are considered benign). Setting the runtime key envoy.restart_features.send_goaway_for_premature_rst_streams to false completely disables this check.

This update also fixes the following security issues:

Fixed in 1.30.2 / 1.29.5 / 1.28.4 / 1.27.6 :

CVE-2024-34362: Crash (use-after-free) in EnvoyQuicServerStream
https://github.com/envoyproxy/envoy/security/advisories/GHSA-hww5-43gv-35jv

CVE-2024-34363: Crash due to uncaught nlohmann JSON exception
https://github.com/envoyproxy/envoy/security/advisories/GHSA-g979-ph9j-5gg4

CVE-2024-34364: Envoy OOM vector from HTTP async client with unbounded response buffer for mirror response, and other components
https://github.com/envoyproxy/envoy/security/advisories/GHSA-xcj3-h7vf-fw26

CVE-2024-32974: Crash in EnvoyQuicServerStream::OnInitialHeadersComplete()
https://github.com/envoyproxy/envoy/security/advisories/GHSA-mgxp-7hhp-8299

CVE-2024-32975: Crash in QuicheDataReader::PeekVarInt62Length()
https://github.com/envoyproxy/envoy/security/advisories/GHSA-g9mq-6v96-cpqc

CVE-2024-32976: Endless loop while decompressing Brotli data with extra input
https://github.com/envoyproxy/envoy/security/advisories/GHSA-7wp5-c2vq-4f8m

CVE-2024-23326: Envoy incorrectly accepts HTTP 200 response for entering upgrade mode
https://github.com/envoyproxy/envoy/security/advisories/GHSA-vcf8-7238-v74c

Fixed in 1.30.4 / 1.29.7 / 1.28.5 / 1.27.7:

Use after free when route hash policy is configured with cookie attributes (CVE-2024-39305)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-fp35-g349-h66f
https://github.com/envoyproxy/envoy/commit/02a06681fbe0e039b1c7a9215257a7537eddb518
https://github.com/envoyproxy/envoy/commit/50b384cb203a1f2894324cbae64b6d9bc44cce45
https://github.com/envoyproxy/envoy/commit/99b6e525fb9f6f6f19a0425f779bc776f121c7e5
https://github.com/envoyproxy/envoy/commit/b7f509607ad860fd6a63cde4f7d6f0197f9f63bb

Fixed in 1.31.2 / 1.30.6 / 1.29.9 / 1.28.7:

Potential to manipulate x-envoy headers from external sources (CVE-2024-45806)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-ffhv-fvxq-r6mf

Oghttp2 crash on OnBeginHeadersForStream (CVE-2024-45807)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-qc52-r4x5-9w37

Malicious log injection via access logs (CVE-2024-45808)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-p222-xhp9-39rc

JWT filter crash in the clear route cache with remote JWKs (CVE-2024-45809)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-wqr5-qmq7-3qw3

Envoy crashes for LocalReply in HTTP async client (CVE-2024-45810)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-qm74-x36m-555q

Fixed in 1.32.3 / 1.31.5 / 1.30.9 / 1.29.12 :

HTTP/1.1 multiple issues with envoy.reloadable_features.http1_balsa_delay_reset (CVE-2024-53271)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-rmm5-h2wv-mg4f
https://github.com/envoyproxy/envoy/commit/da56f6da63079baecef9183436ee5f4141a59af8

HTTP/1: sending overload crashes when the request is reset beforehand (CVE-2024-53270)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-q9qv-8j52-77p3
https://github.com/envoyproxy/envoy/pull/37743/commits/6cf8afda956ba67c9afad185b962325a5242ef02

Happy Eyeballs: Validate that additional_address are IP addresses instead of crashing when sorting (CVE-2024-53269)
https://github.com/envoyproxy/envoy/security/advisories/GHSA-mfqp-7mmj-rm53
https://github.com/envoyproxy/envoy/pull/37743/commits/3f62168d86aceb90f743f63b50cc711710b1c401

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1185232 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/production-images@master] envoy-future: Update to v1.29.12

https://gerrit.wikimedia.org/r/1185232

Change #1185232 merged by RLazarus:

[operations/docker-images/production-images@master] envoy-future: Update to v1.29.12

https://gerrit.wikimedia.org/r/1185232

Change #1186056 had a related patch set uploaded (by RLazarus; author: RLazarus):

[integration/config@master] helm-linter: Bump for envoy 1.29.12

https://gerrit.wikimedia.org/r/1186056

Change #1186057 had a related patch set uploaded (by RLazarus; author: RLazarus):

[integration/config@master] jjb: Update to helm-linter:0.7.4 to pick up envoy-future 1.29.12

https://gerrit.wikimedia.org/r/1186057

Change #1186056 merged by jenkins-bot:

[integration/config@master] helm-linter: Bump for envoy 1.29.12

https://gerrit.wikimedia.org/r/1186056

Change #1186057 merged by jenkins-bot:

[integration/config@master] jjb: Update to helm-linter:0.7.4 to pick up envoy-future 1.29.12

https://gerrit.wikimedia.org/r/1186057

Mentioned in SAL (#wikimedia-operations) [2025-09-08T23:31:12Z] <rzl> helmfile -e eqiad -i apply --set mesh.image_name=envoy-future --set mesh.image_version=1.29.12-1 --context=5 # T403663

Change #1186099 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mathoid: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1186099

Change #1186099 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1186099

Change #1186676 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging

https://gerrit.wikimedia.org/r/1186676

Change #1186676 merged by jenkins-bot:

[operations/deployment-charts@master] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in staging

https://gerrit.wikimedia.org/r/1186676

Mentioned in SAL (#wikimedia-operations) [2025-09-10T23:25:51Z] <rzl> sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.29.12-1_amd64.deb # T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-10T23:26:05Z] <rzl> sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy # T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-10T23:26:12Z] <rzl> sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy # T403663

Change #1187134 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/production-images@master] envoy: Update to v1.29.12

https://gerrit.wikimedia.org/r/1187134

Change #1187134 merged by RLazarus:

[operations/docker-images/production-images@master] envoy: Update to v1.29.12

https://gerrit.wikimedia.org/r/1187134

Mentioned in SAL (#wikimedia-operations) [2025-09-17T07:02:57Z] <moritzm> upgrading Envoy on debmonitor T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T08:27:58Z] <moritzm> upgrading Envoy on IDM hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T08:42:18Z] <moritzm> upgrading Envoy on deployment hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T14:07:29Z] <moritzm> upgrading Envoy on IDP hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T17:23:15Z] <mutante> upgrading envoyproxy on releases* hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T17:24:26Z] <mutante> upgrading envoyproxy on doc* hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T17:29:07Z] <mutante> upgrading envoyproxy on zuul* hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T17:35:57Z] <mutante> upgrading envoyproxy on planet* and people* hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-17T18:07:52Z] <mutante> upgrading envoyproxy on etherpad* and stewards* hosts T403663

Dzahn changed the task status from Open to In Progress.Sep 17 2025, 11:12 PM

Mentioned in SAL (#wikimedia-operations) [2025-09-18T14:34:36Z] <moritzm> upgrading Envoy on cloudweb hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-18T17:15:11Z] <mutante> upgrading envoyproxy on aphlict1002 (active phab notifications) and contint2002 (active CI) T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-18T17:19:00Z] <mutante> upgrading envoyproxy on lists1004 (active lists server) T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-18T17:32:43Z] <mutante> upgrading envoyproxy on vrts1003 (active ticket.wikimedia.org ) T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-18T23:33:48Z] <mutante> upgrading envoyproxy on production phabricator (phab1004) - T403663

All services owned by collaboration-services have been upgraded to 1.29.12-1.

 sudo cumin 'A:owner-collaboration-services' 'dpkg -l | grep envoyproxy'

.. 45 hosts ...

.. 1.29.12-1 ..

Mentioned in SAL (#wikimedia-operations) [2025-09-22T08:02:34Z] <moritzm> upgrading Envoy on webperf hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-22T11:41:40Z] <moritzm> upgrading Envoy on puppetboard hosts T403663

Change #1190363 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-*: Upgrade to Envoy 1.29.12 in the MW canary releases and mw-debug

https://gerrit.wikimedia.org/r/1190363

Change #1190363 merged by jenkins-bot:

[operations/deployment-charts@master] mw-*: Upgrade to Envoy 1.29.12 in the MW canary releases and mw-debug

https://gerrit.wikimedia.org/r/1190363

Mentioned in SAL (#wikimedia-operations) [2025-09-23T00:37:48Z] <rzl@deploy1003> Finished scap sync-world: https://gerrit.wikimedia.org/r/1190363 T403663 (duration: 05m 12s)

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:20:08Z] <denisse> Upgrade Envoy to v1.29.12 on grafana hosts - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:22:05Z] <denisse> Upgrade Envoy to v1.29.12 on logstash hosts - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:32:15Z] <denisse> Upgrade Envoy to v1.29.12 on graphite hosts - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:37:21Z] <denisse> Upgrade Envoy to v1.29.12 on prometheus hosts - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:39:54Z] <denisse> Upgrade Envoy to v1.29.12 on prometheus::pop hosts - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:40:53Z] <denisse> Upgrade Envoy to v1.29.12 on titan hosts - T403663

Change #1190376 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production

https://gerrit.wikimedia.org/r/1190376

Change #1190376 merged by jenkins-bot:

[operations/deployment-charts@master] {api,rest}-gateway: Upgrade to Envoy 1.29.12 in production

https://gerrit.wikimedia.org/r/1190376

Mentioned in SAL (#wikimedia-operations) [2025-09-24T10:23:34Z] <claime> Upgraded envoy to v1.29.12 on api-gateway and rest-gateway - T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-24T13:27:10Z] <moritzm> upgrade Envoy on puppet servers T403663

Dzahn triaged this task as High priority.Sep 24 2025, 5:24 PM

This seems like it's treated as High priority. That being said, I am not sure if clinic duty is still supposed to make decisions on priority at all.

That being said, I am not sure if clinic duty is still supposed to make decisions on priority at all.

It's not, the only thing needed is to add team-specific tags for tasks only tagged with SRE. The prioritisation is done on the team-specific workboards

Change #1191522 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-*: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1191522

Change #1191523 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mw-videoscaler: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1191523

Change #1191526 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] kubernetes: Set default Envoy version to 1.29.12

https://gerrit.wikimedia.org/r/1191526

Mentioned in SAL (#wikimedia-operations) [2025-09-29T06:37:48Z] <moritzm> upgrade Envoy on chartmuseum hosts T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-29T07:37:11Z] <moritzm> upgrade Envoy on config-master* T403663

Mentioned in SAL (#wikimedia-operations) [2025-09-29T10:53:07Z] <moritzm> upgrade Envoy on an-web1001 T403663

Change #1191522 merged by jenkins-bot:

[operations/deployment-charts@master] mw-*: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1191522

Mentioned in SAL (#wikimedia-operations) [2025-09-29T21:46:37Z] <rzl@deploy2002> Finished scap sync-world: https://gerrit.wikimedia.org/r/1191522 T403663 (duration: 06m 44s)

Change #1191523 merged by jenkins-bot:

[operations/deployment-charts@master] mw-videoscaler: Upgrade to Envoy 1.29.12

https://gerrit.wikimedia.org/r/1191523

The RESTBase cluster has been upgraded to v1.29.12 (sorry for the delay, I was out all last week and missed the message).

Mentioned in SAL (#wikimedia-operations) [2025-10-06T12:08:28Z] <moritzm> upgrade Envoy on yarn/turnilo hosts T403663

Change #1191526 merged by RLazarus:

[operations/puppet@production] kubernetes: Set default Envoy version to 1.29.12

https://gerrit.wikimedia.org/r/1191526