Page MenuHomePhabricator

Upgrade Envoy to supported version
Closed, ResolvedPublic

Description

Right now we're running Envoy 1.18.3 in production.

And work has been done on this task in order to update to 1.20.x or 1.21.x and there are some config migrations open (see subtasks) that can/should be completed while we're on 1.18.x:

Prereqs:

  • Use v3 configuration API everywhere (done in https://gerrit.wikimedia.org/r/754460)
  • Check all the intermediate release notes for any other compatibility issues in our config that need to be resolved before we begin
  • Choose a new target version, 1.25.x or 1.26.x.

From the previous (1.18.x) upgrade:

  • Update everything to 1.15.5, the current master version at operations/debs/envoyproxy - that is, clean up 1.15.4 first
  • Advance the master branch to 1.18.3 (the current envoy-future version)
  • Test 1.18.3 in the helm-linter image, to verify the config is compatible and check for deprecation warnings
  • Roll out 1.18.3 to all Envoy environments

From bluntly trying to validate our current envoy config on appservers and mobileapps k8s deployment I get:

Changelog:
v1.22

  • tls: set TLS v1.2 as the default minimal version for servers. Users can still explicitly opt-in to 1.0 and 1.1 using tls_minimum_protocol_version.
    • We set tls_minimum_protocol_version: TLSv1_2 everywhere, that can probably be removed
  • config: type URL is used to lookup extensions regardless of the name field. This may cause problems for empty filter configurations or mis-matched protobuf as the typed configurations. This behavioral change can be temporarily reverted by setting runtime guard envoy.reloadable_features.no_extension_lookup_by_name to false. T337405
  • http: validate upstream request header names and values. The new runtime flag envoy.reloadable_features.validate_upstream_headers can be used for revert this behavior.

v1.23

  • router: updated all HTTP filters to get per-filter config by the HTTP filter config name. If there is no entry referred by the filter config name, the canonical filter name (e.g., envoy.filters.http.buffer for the HTTP buffer filter) will be used for the backwards compatibility.
  • stats listener: fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from

envoy_listener_myprefix_downstream_cx_overflow{} to envoy_listener_downstream_cx_overflow{envoy_listener_address="myprefix"}. This does not affect the Prometheus name if stat_prefix is not set.

  • stats listener: fixed metric tag extraction so that worker_id is properly extracted from the listener stats. This changes the Prometheus name from envoy_listener_worker_1_downstream_cx_active{envoy_listener_address="0.0.0.0_10000"} to envoy_listener_downstream_cx_active{envoy_listener_address="0.0.0.0_10000", envoy_worker_id="1"}.
  • stats server: fixed metric tag extraction so that worker_id is properly extracted fromt the server stats. This changes the Prometheus name from envoy_server_worker_1_watchdog_miss{} to envoy_server_watchdog_miss{envoy_worker_id="1"}.

v1.24

  • stats http local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_http_local_rate_limit_myprefix_rate_limited{} to envoy_http_local_rate_limit_rate_limited{envoy_local_http_ratelimit_prefix=”myprefix”}.
  • stats network local_rate_limit: Fixed metric tag extraction so that stat_prefix is properly extracted. This changes the Prometheus name from envoy_local_rate_limit_myprefix_rate_limited{} to envoy_local_rate_limit_rate_limited{envoy_local_ratelimit_prefix=”myprefix”}.
  • stats: Default tag extraction rules were changed for worker_id extraction. Previously, worker_ was removed from the original name during the extraction. This led to the same base stat name for both the per-worker and overall stat. For instance, in prometheus stats, the following stats were produced: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_downstream_cx_total{envoy_worker_id=”1”} 1. This resulted in sum(envoy_listener_downstream_cx_total) producing 4, even though there are only 2 connections. The new behavior results in stats such as this: :: envoy_listener_downstream_cx_total{} 2. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”0”} 1. envoy_listener_worker_downstream_cx_total{envoy_worker_id=”1”} 1.

v1.25

v1.26

  • nothing standing out

The upstream binaries of Envoy versions >= 1.24 are no longer compatible with buster due to libc version requirements. That's why we settled with 1.23.10 for now.**

Things to do:

Details

Other Assignee
hnowlan
ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+677 -14 K
operations/deployment-chartsmaster+447 -0
operations/deployment-chartsmaster+0 -4
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -4
operations/docker-images/production-imagesmaster+7 -0
operations/deployment-chartsmaster+0 -1
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+2 -3
integration/configmaster+1 -1
integration/configmaster+10 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+8 -6
operations/deployment-chartsmaster+29 -9
operations/deployment-chartsmaster+7 -4
operations/deployment-chartsmaster+72 -0
operations/docker-images/production-imagesmaster+40 -8
operations/docker-images/production-imagesmaster+2 -2
operations/debs/envoyproxyv1.23+17 -9
operations/debs/envoyproxyv1.26+31 -1
operations/docker-images/production-imagesmaster+1 -0
operations/docker-images/production-imagesmaster+12 -0
operations/puppetproduction+2 -1
operations/deployment-chartsmaster+1 K -14 K
operations/deployment-chartsmaster+534 -80
operations/deployment-chartsmaster+14 -4
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+50 -43
operations/deployment-chartsmaster+429 -0
integration/configmaster+8 -1
operations/deployment-chartsmaster+0 -4
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -0
integration/configmaster+1 -1
integration/configmaster+7 -1
integration/configmaster+1 -1
integration/configmaster+5 -0
operations/deployment-chartsmaster+0 -4
operations/docker-images/production-imagesmaster+6 -0
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+3 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 919848 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update charts from mesh.configuration 1.2.0 to 1.2.1

https://gerrit.wikimedia.org/r/919848

Change 919849 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update charts from mesh.configuration 1.1 to 1.2

https://gerrit.wikimedia.org/r/919849

Change 919848 merged by jenkins-bot:

[operations/deployment-charts@master] Update charts from mesh.configuration 1.2.0 to 1.2.1

https://gerrit.wikimedia.org/r/919848

Change 919849 merged by jenkins-bot:

[operations/deployment-charts@master] Update charts from mesh.configuration 1.1 to 1.2

https://gerrit.wikimedia.org/r/919849

Change 916499 merged by JMeybohm:

[operations/puppet@production] envoyproxy: Add python 3.11 to tox

https://gerrit.wikimedia.org/r/916499

I've added a v1.26 branch to the envoyproxy repo with the upstream code removed and packaging the upstream binary instead:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/envoyproxy/+/refs/heads/v1.26

As with kubernetes and isio I choose a branch per minor version instead of "envoy-future" to make it more clear and to allow for easier upgrades of older versions while already running a newer one somewhere. Will go over the changelogs now but it would be nice to have a second pair of eyes on the packaging proposal.

As with kubernetes and isio I choose a branch per minor version instead of "envoy-future" to make it more clear and to allow for easier upgrades of older versions while already running a newer one somewhere. Will go over the changelogs now but it would be nice to have a second pair of eyes on the packaging proposal.

Couple of comments (here) as this is an already pushed branch without a code-review in gerrit.

  • Is there a specific reason that debian/source/format says 1.0 instead of 3.0 (quilt) ?
  • debian/changelog should have an entry to document the change in how we build the package now and that we are no longer trying to get the source code built.

Otherwise, LGTM

Change 922837 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/envoyproxy@v1.26] Add README, enhance changelog and switch to source format 3

https://gerrit.wikimedia.org/r/922837

  • Is there a specific reason that debian/source/format says 1.0 instead of 3.0 (quilt) ?
  • debian/changelog should have an entry to document the change in how we build the package now and that we are no longer trying to get the source code built.

Thanks. I chose 1.0 becuse it won't try to create patches for sources etc. I think "3.0 (native)" will do the same, changing that.

JMeybohm updated the task description. (Show Details)

Change 934335 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] envoy-future: Update to 1.26.1 and add draining script

https://gerrit.wikimedia.org/r/934335

Mentioned in SAL (#wikimedia-operations) [2023-06-29T14:04:09Z] <jayme> imported envoyproxy 1.26.1 to component/envoy-future in buster-wikimedia - T300324

Change 934335 merged by JMeybohm:

[operations/docker-images/production-images@master] envoy-future: Update to 1.26.1 and add draining script

https://gerrit.wikimedia.org/r/934335

Change 934340 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] envoy-future: Update to 1.26.1 and add draining script

https://gerrit.wikimedia.org/r/934340

Change 934340 merged by JMeybohm:

[operations/docker-images/production-images@master] envoy-future: Update to 1.26.1 and add draining script

https://gerrit.wikimedia.org/r/934340

Mentioned in SAL (#wikimedia-operations) [2023-06-29T14:46:59Z] <jayme> published image docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324

Change 922837 merged by JMeybohm:

[operations/debs/envoyproxy@v1.26] Add README, enhance changelog and switch to source format 3

https://gerrit.wikimedia.org/r/922837

Mentioned in SAL (#wikimedia-operations) [2023-06-30T07:52:33Z] <jayme> removed docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324

#wikimedia-operations: <jayme> removed docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324

Since 1.24, envoy required libc 2.29 and buster contains 2.28 only. Pulling back the image for now as it won't work (it's based on buster) and we don't have a clear path forward here currently.

Mentioned in SAL (#wikimedia-operations) [2023-06-30T08:00:56Z] <jayme> rolled back envoyproxy package in buster-wikimedia component/envoy-future to 1.18.3-1 - T300324

Change 934490 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/envoyproxy@v1.23] New upstream version 1.23.10

https://gerrit.wikimedia.org/r/934490

Change 934490 merged by JMeybohm:

[operations/debs/envoyproxy@v1.23] New upstream version 1.23.10

https://gerrit.wikimedia.org/r/934490

Change 934494 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] Downgrade envoy-future from 1.26 to 1.23

https://gerrit.wikimedia.org/r/934494

Change 934494 merged by JMeybohm:

[operations/docker-images/production-images@master] Downgrade envoy-future from 1.26 to 1.23

https://gerrit.wikimedia.org/r/934494

Change 934506 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] envoy*: Fix envoy-basic-config, add tests

https://gerrit.wikimedia.org/r/934506

Change 934506 merged by JMeybohm:

[operations/docker-images/production-images@master] envoy*: Fix envoy-basic-config, add tests

https://gerrit.wikimedia.org/r/934506

Mentioned in SAL (#wikimedia-operations) [2023-06-30T11:14:17Z] <jayme> imported envoyproxy 1.23.10 to component/envoy-future in buster-wikimedia - T300324

Mentioned in SAL (#wikimedia-operations) [2023-06-30T11:15:20Z] <jayme> published image docker-registry.discovery.wmnet/envoy:1.18.3-2-s3 and docker-registry.discovery.wmnet/envoy-future:1.23.10-1-s1 - T300324

Change 934509 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[integration/config@master] helm-linter: Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934509

Change 934512 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add new mesh.deployment version 1.2.1

https://gerrit.wikimedia.org/r/934512

Change 934513 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.deployment: Allow to configure the envoy image name

https://gerrit.wikimedia.org/r/934513

Change 934514 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mathoid: Update to mesh.deployment:1.2

https://gerrit.wikimedia.org/r/934514

Change 934520 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mediawiki: Update to mesh.deployment 1.2.1

https://gerrit.wikimedia.org/r/934520

Change 934521 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mathoid: Use envoy-future in staging

https://gerrit.wikimedia.org/r/934521

Change 934512 merged by jenkins-bot:

[operations/deployment-charts@master] Add new mesh.deployment version 1.2.1

https://gerrit.wikimedia.org/r/934512

Change 934513 merged by jenkins-bot:

[operations/deployment-charts@master] mesh.deployment: Allow to configure the envoy image name

https://gerrit.wikimedia.org/r/934513

Change 934514 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Update to mesh.deployment:1.2

https://gerrit.wikimedia.org/r/934514

Change 934520 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Update to mesh.deployment 1.2.1

https://gerrit.wikimedia.org/r/934520

Change 934521 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Use envoy-future in staging

https://gerrit.wikimedia.org/r/934521

Change 934580 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mathoid: Enable telemetry

https://gerrit.wikimedia.org/r/934580

Change 934580 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Enable telemetry

https://gerrit.wikimedia.org/r/934580

Change 934509 merged by jenkins-bot:

[integration/config@master] Docker: [helm-linter] Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934509

Mentioned in SAL (#wikimedia-releng) [2023-06-30T18:33:39Z] <James_F> Docker: [helm-linter] Switch to envoy 1.23.10 for T300324

Change 934510 had a related patch set uploaded (by Jforrester; author: JMeybohm):

[integration/config@master] jjb: [helm-lint] Update to new helm-linter image with envoy 1.23.10

https://gerrit.wikimedia.org/r/934510

Change 934510 merged by jenkins-bot:

[integration/config@master] jjb: [helm-lint] Update to new helm-linter image with envoy 1.23.10

https://gerrit.wikimedia.org/r/934510

Change 934996 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mathoid: Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934996

Change 934998 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mw-debug: Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934998

Change 934996 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934996

Change 934998 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Switch to envoy 1.23.10

https://gerrit.wikimedia.org/r/934998

Change 935008 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mw-debug: Remove envoy version override in codfw

https://gerrit.wikimedia.org/r/935008

Change 935008 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Remove envoy version override in codfw

https://gerrit.wikimedia.org/r/935008

Mentioned in SAL (#wikimedia-operations) [2023-07-03T10:42:28Z] <jayme> imported envoyproxy 1.23.10 to buster-wikimedia, bullseye-wikimedia, bookworm-wikimedia - T300324

Change 935073 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] envoy: Promote 1.23.10 from envoy-future to envoy

https://gerrit.wikimedia.org/r/935073

Change 935073 merged by JMeybohm:

[operations/docker-images/production-images@master] envoy: Promote 1.23.10 from envoy-future to envoy

https://gerrit.wikimedia.org/r/935073

Change 935074 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] deployment_server::general: bump default envoy version to 1.23.10

https://gerrit.wikimedia.org/r/935074

Change 935097 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::deployment_server: Globally enable envoy telemetry

https://gerrit.wikimedia.org/r/935097

mw and restbase canaries as well as mathoid are running 1.23.10 since today. If nothing comes up I will roll the update out to the rest of the fleet tomorrow

Change 935074 merged by JMeybohm:

[operations/puppet@production] deployment_server::general: bump default envoy version to 1.23.10

https://gerrit.wikimedia.org/r/935074

Change 935097 merged by JMeybohm:

[operations/puppet@production] kubernetes::deployment_server: Globally enable envoy telemetry

https://gerrit.wikimedia.org/r/935097

Mentioned in SAL (#wikimedia-operations) [2023-07-04T09:38:44Z] <jayme> updated envoyproxy to 1.23.10 on all nodes - T300324

Change 935404 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] similar-users: Bump version

https://gerrit.wikimedia.org/r/935404

Change 935404 merged by jenkins-bot:

[operations/deployment-charts@master] similar-users: Bump version

https://gerrit.wikimedia.org/r/935404

JMeybohm closed subtask Restricted Task as Resolved.Jul 4 2023, 11:57 AM
JMeybohm closed subtask Restricted Task as Resolved.

Change 935435 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mathoid: Switch back to default envoy

https://gerrit.wikimedia.org/r/935435

Change 935435 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid,mw-debug: Switch back to default envoy

https://gerrit.wikimedia.org/r/935435

All nodes and most k8s deployments have been updated to run 1.23.10, only exceptions are api-gateway and rest-gateway which still run 1.18 as well as datahub (cc @BTullis ) which I did not deploy because it has a huge diff I'm not able to reason about.

Change 935679 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add mesh.configuration 1.3.2

https://gerrit.wikimedia.org/r/935679

Change 935702 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.configuration: Limit the total number of active connections

https://gerrit.wikimedia.org/r/935702

... as datahub (cc @BTullis ) which I did not deploy because it has a huge diff I'm not able to reason about.

Thanks ever so much and apologies about the current state of datahub on wikikube. I'll take care of deploying the new envoy version, once I've finished working on fixing the staging deployment of datahub.

Change 935754 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] mesh.configuration: Update all charts t 1.3.2

https://gerrit.wikimedia.org/r/935754

Change 935679 merged by jenkins-bot:

[operations/deployment-charts@master] Add mesh.configuration 1.3.2

https://gerrit.wikimedia.org/r/935679

Change 935754 merged by jenkins-bot:

[operations/deployment-charts@master] mesh.configuration: Update all charts to 1.3.2

https://gerrit.wikimedia.org/r/935754

Change 937042 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] envoy: Absent check for zero runtime changes

https://gerrit.wikimedia.org/r/937042

Change 937042 merged by JMeybohm:

[operations/puppet@production] envoy: Absent check for zero runtime changes

https://gerrit.wikimedia.org/r/937042

JMeybohm closed subtask Restricted Task as Resolved.