Page MenuHomePhabricator

Upgrade Envoy to >= 1.24
Closed, ResolvedPublic

Description

Right now we're running Envoy 1.23.10 in production. There is no immediate urge to update, but I was driving by and wanted to persist the already done changelog parsing from T300324 in a new, open task.

The upstream binaries of Envoy versions >= 1.24 are no longer compatible with buster due to libc version requirements. That's why we settled with 1.23.10 for now.

And work has been done on this task in order to update to 1.20.x or 1.21.x and there are some config migrations open (see subtasks) that can/should be completed while we're on 1.18.x:

Prereqs:

  • Choose a new target version
  • Check all the intermediate release notes for any other compatibility issues in our config that need to be resolved before we begin
  • Check changelogs

Changelog (copied from T300324):
v1.24

  • stats http local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_http_local_rate_limit_myprefix_rate_limited{} to envoy_http_local_rate_limit_rate_limited{envoy_local_http_ratelimit_prefix=”myprefix”}.
  • stats network local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_local_rate_limit_myprefix_rate_limited{} to envoy_local_rate_limit_rate_limited{envoy_local_ratelimit_prefix=”myprefix”}.
  • stats: Default tag extraction rules were changed for worker_id extraction. Previously, worker_ was removed from the original name during the extraction. This led to the same base stat name for both the per-worker and overall stat. For instance, in prometheus stats, the following stats were produced: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_downstream_cx_total{envoy_worker_id=”1”} 1. This resulted in sum(envoy_listener_downstream_cx_total) producing 4, even though there are only 2 connections. The new behavior results in stats such as this: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”1”} 1.

v1.25

v1.26

  • nothing standing out

Things to do:

Related Objects

Event Timeline

jijiki triaged this task as Medium priority.Nov 20 2024, 4:57 PM
jijiki moved this task from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.

Ideally we'd need to go to 1.33 or later with this work - I have not scoped how many more complications this work will entail but for WE5.1.3 and future rate limiting work we'll need the addend features of rate limiting that these versions add.

RLazarus subscribed.

Ideally we'd need to go to 1.33 or later with this work

Ack. We're running 1.23 and the current release is 1.35, so there's quite a gap to cover -- I'll look into whether it makes more sense to do this in one big jump (given the nontrivial effort involved in each upgrade) or several smaller ones (given the delta in config versions, especially as we skip over entire deprecation periods). But the WE5.1.3 dependency on 1.33 is a good callout, thanks.

Change #1180904 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] aptrepo: Add envoy-future component for bullseye

https://gerrit.wikimedia.org/r/1180904

Change #1180904 merged by RLazarus:

[operations/puppet@production] aptrepo: Add envoy-future component for bullseye

https://gerrit.wikimedia.org/r/1180904

ssingh subscribed.

For awareness: I checked with @RLazarus and removing the Traffic tag. We can add back later as required.

Change #1187036 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: fix service name munging post-Envoy upgrade

https://gerrit.wikimedia.org/r/1187036

Change #1187036 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: fix service name munging post-Envoy upgrade

https://gerrit.wikimedia.org/r/1187036

Change #1188456 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add a script for mass-deploying helmfile services

https://gerrit.wikimedia.org/r/1188456

Change #1188456 merged by RLazarus:

[operations/puppet@production] deployment_server: Add a script for mass-deploying helmfile services

https://gerrit.wikimedia.org/r/1188456

Still some hosts remaining to upgrade to 1.35 in T410975, but we don't need this umbrella task open to track the multi-stage upgrade anymore.