Right now we're running Envoy 1.23.10 in production. There is no immediate urge to update, but I was driving by and wanted to persist the already done changelog parsing from T300324 in a new, open task.
The upstream binaries of Envoy versions >= 1.24 are no longer compatible with buster due to libc version requirements. That's why we settled with 1.23.10 for now.
And work has been done on this task in order to update to 1.20.x or 1.21.x and there are some config migrations open (see subtasks) that can/should be completed while we're on 1.18.x:
- T303230: Refactor envoy HTTP protocol options to new version
- T303231: Refactor envoy access_log_path to access loggers
Prereqs:
- Choose a new target version
- Check all the intermediate release notes for any other compatibility issues in our config that need to be resolved before we begin
- Check changelogs
Changelog (copied from T300324):
v1.24
- original_dst: ORIGINAL_DST cluster will not attempt to remove and drain the stale hosts during cleanup if they are still used by the connection pools. For HTTP pools, please set https://www.envoyproxy.io/docs/envoy/v1.24.7/faq/configuration/timeouts#faq-configuration-connection-timeouts to limit the duration of the upstream connections (the default value is 1h, and the recommended value is 5min). This behavior change can be reverted by setting runtime guard envoy.reloadable_features.original_dst_rely_on_idle_timeout.
- stats http local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_http_local_rate_limit_myprefix_rate_limited{} to envoy_http_local_rate_limit_rate_limited{envoy_local_http_ratelimit_prefix=”myprefix”}.
- stats network local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_local_rate_limit_myprefix_rate_limited{} to envoy_local_rate_limit_rate_limited{envoy_local_ratelimit_prefix=”myprefix”}.
- stats: Default tag extraction rules were changed for worker_id extraction. Previously, worker_ was removed from the original name during the extraction. This led to the same base stat name for both the per-worker and overall stat. For instance, in prometheus stats, the following stats were produced: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_downstream_cx_total{envoy_worker_id=”1”} 1. This resulted in sum(envoy_listener_downstream_cx_total) producing 4, even though there are only 2 connections. The new behavior results in stats such as this: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”1”} 1.
- tcp: added https://www.envoyproxy.io/docs/envoy/v1.25.6/api-v3/extensions/upstreams/tcp/v3/tcp_protocol_options.proto#envoy-v3-api-field-extensions-upstreams-tcp-v3-tcpprotocoloptions-idle-timeout to support per client idle timeout for tcp connection pool. The timeout is guarded by envoy.reloadable_features.tcp_pool_idle_timeout and timeout defaults to 10 minutes if runtime flag is enabled.
- tls: added support for intermediate CA as trusted CA. The peer certificate issued by an intermediate CA will be trusted by building valid partial chain. Before, it could not be verified without trusting its ancestor root CA and building a full chain. https://www.envoyproxy.io/docs/envoy/v1.25.6/api-v3/extensions/transport_sockets/tls/v3/common.proto#envoy-v3-api-field-extensions-transport-sockets-tls-v3-certificatevalidationcontext-trusted-ca. This change can be reverted via the runtime flag envoy.reloadable_features.enable_intermediate_ca.
- nothing standing out
Things to do:
- Build new envoy packages
- Build new envoy-future container
- Update envoy as in https://wikitech.wikimedia.org/wiki/Envoy#Update_envoy