Right now we're running Envoy 1.18.3 in production.
And work has been done on this task in order to update to 1.20.x or 1.21.x and there are some config migrations open (see subtasks) that can/should be completed while we're on 1.18.x:
- T303230: Refactor envoy HTTP protocol options to new version
- T303231: Refactor envoy access_log_path to access loggers
Prereqs:
- Use v3 configuration API everywhere (done in https://gerrit.wikimedia.org/r/754460)
- Check all the intermediate release notes for any other compatibility issues in our config that need to be resolved before we begin
- Choose a new target version,
1.25.x or1.26.x.
From the previous (1.18.x) upgrade:
- Update everything to 1.15.5, the current master version at operations/debs/envoyproxy - that is, clean up 1.15.4 first
- Advance the master branch to 1.18.3 (the current envoy-future version)
- Test 1.18.3 in the helm-linter image, to verify the config is compatible and check for deprecation warnings
- Roll out 1.18.3 to all Envoy environments
From bluntly trying to validate our current envoy config on appservers and mobileapps k8s deployment I get:
- {T337405: Refactor envoy.filters.http.router and envoy.filters.listener.tls_inspector}
- [source/common/protobuf/message_validator_impl.cc:21] Deprecated field: type envoy.config.cluster.v3.Cluster Using deprecated option 'envoy.config.cluster.v3.Cluster.max_requests_per_connection' from file cluster.proto. T304124
Changelog:
v1.22
- tls: set TLS v1.2 as the default minimal version for servers. Users can still explicitly opt-in to 1.0 and 1.1 using tls_minimum_protocol_version.
- We set tls_minimum_protocol_version: TLSv1_2 everywhere, that can probably be removed
- config: type URL is used to lookup extensions regardless of the name field. This may cause problems for empty filter configurations or mis-matched protobuf as the typed configurations. This behavioral change can be temporarily reverted by setting runtime guard envoy.reloadable_features.no_extension_lookup_by_name to false. T337405
- http: validate upstream request header names and values. The new runtime flag envoy.reloadable_features.validate_upstream_headers can be used for revert this behavior.
- router: updated all HTTP filters to get per-filter config by the HTTP filter config name. If there is no entry referred by the filter config name, the canonical filter name (e.g., envoy.filters.http.buffer for the HTTP buffer filter) will be used for the backwards compatibility.
- stats listener: fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from
envoy_listener_myprefix_downstream_cx_overflow{} to envoy_listener_downstream_cx_overflow{envoy_listener_address="myprefix"}. This does not affect the Prometheus name if stat_prefix is not set.
- stats listener: fixed metric tag extraction so that worker_id is properly extracted from the listener stats. This changes the Prometheus name from envoy_listener_worker_1_downstream_cx_active{envoy_listener_address="0.0.0.0_10000"} to envoy_listener_downstream_cx_active{envoy_listener_address="0.0.0.0_10000", envoy_worker_id="1"}.
- stats server: fixed metric tag extraction so that worker_id is properly extracted fromt the server stats. This changes the Prometheus name from envoy_server_worker_1_watchdog_miss{} to envoy_server_watchdog_miss{envoy_worker_id="1"}.
- original_dst: ORIGINAL_DST cluster will not attempt to remove and drain the stale hosts during cleanup if they are still used by the connection pools. For HTTP pools, please set https://www.envoyproxy.io/docs/envoy/v1.24.7/faq/configuration/timeouts#faq-configuration-connection-timeouts to limit the duration of the upstream connections (the default value is 1h, and the recommended value is 5min). This behavior change can be reverted by setting runtime guard envoy.reloadable_features.original_dst_rely_on_idle_timeout.
- stats http local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_http_local_rate_limit_myprefix_rate_limited{} to envoy_http_local_rate_limit_rate_limited{envoy_local_http_ratelimit_prefix=”myprefix”}.
- stats network local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_local_rate_limit_myprefix_rate_limited{} to envoy_local_rate_limit_rate_limited{envoy_local_ratelimit_prefix=”myprefix”}.
- stats: Default tag extraction rules were changed for worker_id extraction. Previously, worker_ was removed from the original name during the extraction. This led to the same base stat name for both the per-worker and overall stat. For instance, in prometheus stats, the following stats were produced: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_downstream_cx_total{envoy_worker_id=”1”} 1. This resulted in sum(envoy_listener_downstream_cx_total) producing 4, even though there are only 2 connections. The new behavior results in stats such as this: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”1”} 1.
- tcp: added https://www.envoyproxy.io/docs/envoy/v1.25.6/api-v3/extensions/upstreams/tcp/v3/tcp_protocol_options.proto#envoy-v3-api-field-extensions-upstreams-tcp-v3-tcpprotocoloptions-idle-timeout to support per client idle timeout for tcp connection pool. The timeout is guarded by envoy.reloadable_features.tcp_pool_idle_timeout and timeout defaults to 10 minutes if runtime flag is enabled.
- tls: added support for intermediate CA as trusted CA. The peer certificate issued by an intermediate CA will be trusted by building valid partial chain. Before, it could not be verified without trusting its ancestor root CA and building a full chain. https://www.envoyproxy.io/docs/envoy/v1.25.6/api-v3/extensions/transport_sockets/tls/v3/common.proto#envoy-v3-api-field-extensions-transport-sockets-tls-v3-certificatevalidationcontext-trusted-ca. This change can be reverted via the runtime flag envoy.reloadable_features.enable_intermediate_ca.
- nothing standing out
The upstream binaries of Envoy versions >= 1.24 are no longer compatible with buster due to libc version requirements. That's why we settled with 1.23.10 for now.**
Things to do:
- Check changelogs of 1.22 to 1.26
- Come up with a new method of packaging envoy
- Build new envoy packages
- Build new envoy-future container
- Update envoy as in https://wikitech.wikimedia.org/wiki/Envoy#Update_envoy