Page MenuHomePhabricator

Upgrade Envoy to supported version
Open, MediumPublic

Description

Right now we're running three versions of Envoy in production: 1.15.4 and 1.15.5 (mixed across appservers and misc services hosts), and 1.18.3 (on cp hosts for T271421). On Kubernetes we're using 1.15.4 and 1.18.3.

None of these versions is currently supported; we should update to either 1.20.x or 1.21.x. (All else being equal, we might as well just get all the way up-to-date and use 1.21, but if there are feature/compatibility issues, or just a preference to use the version with more miles on it, 1.20 is still viable.)

Prereqs:

  • Use v3 configuration API everywhere (done in https://gerrit.wikimedia.org/r/754460)
  • Check all the intermediate release notes for any other compatibility issues in our config that need to be resolved before we begin
  • Choose a target version, 1.20.x or 1.21.x. I'll inquire with Traffic and with the API Gateway folks about any preference between the two. We'll go with v1.21.3.

Here's one way the rollout might go, exact plan still TBD:

  • Update everything to 1.15.5, the current master version at operations/debs/envoyproxy - that is, clean up 1.15.4 first
  • Advance the master branch to 1.18.3 (the current envoy-future version)
  • Test 1.18.3 in the helm-linter image, to verify the config is compatible and check for deprecation warnings
  • Roll out 1.18.3 to all Envoy environments
  • Test 1.21.3 in the helm-linter image
  • Advance the envoy-future branch to 1.21.3
  • Roll out 1.21.3 to all environments that use envoy-future
  • Advance the master branch to 1.21.3
  • Roll out 1.21.3 everywhere

Event Timeline

RLazarus triaged this task as Medium priority.Jan 28 2022, 1:10 AM
RLazarus created this task.

Thanks, this looks like an excellent plan. I would suggest that when we move to 1.18, we might want to start from the thanos-fe cluster which would see fixing of a real issue, see T300119#7670883

this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus

this looks great :) in traffic we're already using 1.18.3 from the envoy-future component, thanks @RLazarus

I think the question we need to answer with you (not immediately, but soon) is "do we go for 1.21 or do we go for 1.20 next"?

Based on the release notes I think the API gateway will most likely have no issue going straight to 1.21. If there are issues they will most likely be minor enough that we can adapt.

Hm, reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye. I assume that's an oversight and not an intentional holdback, but so far I haven't been able to confirm for sure; @JMeybohm do you happen to know? I'll copy the 1.15.5 build to stretch unless there's a reason not to.

(That's separate from the fact that some buster machines are running 1.15.4 and showing upgradable in debmonitor -- I'll take care of that with a regular debdeploy run.)

Hm, reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye. I assume that's an oversight and not an intentional holdback, but so far I haven't been able to confirm for sure; @JMeybohm do you happen to know? I'll copy the 1.15.5 build to stretch unless there's a reason not to.

I don't know for sure and would assume oversight as well. I also don't see a reason not to copy it over.

reprepro has only has 1.15.4 in wikimedia-stretch, compared to 1.15.5 in buster and bullseye.

Correction: 1.15.5 is only in buster; both stretch and bullseye have 1.15.4. Going ahead with 1.15.4 -> 1.15.5 everywhere.

Mentioned in SAL (#wikimedia-operations) [2022-02-03T21:27:42Z] <rzl> root@apt1001:/home/rzl# reprepro copy stretch-wikimedia buster-wikimedia envoyproxy # T300324

Mentioned in SAL (#wikimedia-operations) [2022-02-03T21:27:59Z] <rzl> root@apt1001:/home/rzl# reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy # T300324

Change 766208 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] miscweb: Update envoy to 1.15.5-1 in staging

https://gerrit.wikimedia.org/r/766208

Change 766209 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] miscweb: Update envoy to 1.15.5-1

https://gerrit.wikimedia.org/r/766209

Change 766208 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Update envoy to 1.15.5-1 in staging

https://gerrit.wikimedia.org/r/766208

Change 766209 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Update envoy to 1.15.5-1

https://gerrit.wikimedia.org/r/766209

Change 766840 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] kubernetes: Upgrade default envoy version to 1.15.5

https://gerrit.wikimedia.org/r/766840

Change 766842 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] miscweb: Restore envoy image_version to the inherited default

https://gerrit.wikimedia.org/r/766842

Change 766840 merged by RLazarus:

[operations/puppet@production] kubernetes: Upgrade default envoy version to 1.15.5

https://gerrit.wikimedia.org/r/766840

1.15.4 is still running in a few places on k8s -- after bumping the default version, I rolled out all services where that was the only diff. Some services had some undeployed changes from who knows how long ago, so I left them untouched (T265979 for that problem in general).

After we bump everything to 1.18 or maybe 1.21, I'll file a task with service owners to clean up the stragglers, but no real reason to do it now.

Mentioned in SAL (#wikimedia-operations) [2022-03-08T16:34:33Z] <rzl> rzl@apt1001:~$ sudo -i reprepro -C main includedeb buster-wikimedia /home/rzl/envoyproxy_1.18.3-1_amd64.deb # reimporting from component/envoy-future into main, for T300324

Mentioned in SAL (#wikimedia-operations) [2022-03-08T20:36:25Z] <rzl> rzl@apt1001:~$ sudo -i reprepro copy stretch-wikimedia buster-wikimedia envoyproxy # T300324

Mentioned in SAL (#wikimedia-operations) [2022-03-08T20:36:34Z] <rzl> rzl@apt1001:~$ sudo -i reprepro copy bullseye-wikimedia buster-wikimedia envoyproxy # T300324

Change 769110 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/production-images@master] envoy: Update to 1.18.3

https://gerrit.wikimedia.org/r/769110

Change 769110 merged by RLazarus:

[operations/docker-images/production-images@master] envoy: Update to 1.18.3

https://gerrit.wikimedia.org/r/769110

Change 769119 had a related patch set uploaded (by RLazarus; author: RLazarus):

[integration/config@master] helm-linter: Update envoyproxy to 1.18.3

https://gerrit.wikimedia.org/r/769119

Change 766842 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Restore envoy image_version to the inherited default

https://gerrit.wikimedia.org/r/766842

Change 769119 merged by jenkins-bot:

[integration/config@master] helm-linter: Update envoyproxy to 1.18.3

https://gerrit.wikimedia.org/r/769119

Change 769419 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: update helm-lint for envoyproxy 1.18.3

https://gerrit.wikimedia.org/r/769419

Change 769419 merged by jenkins-bot:

[integration/config@master] jjb: update helm-lint for envoyproxy 1.18.3

https://gerrit.wikimedia.org/r/769419

Change 769477 had a related patch set uploaded (by RLazarus; author: RLazarus):

[integration/config@master] helm-linter: Set permissions for /var/log/envoy

https://gerrit.wikimedia.org/r/769477

Change 769477 merged by jenkins-bot:

[integration/config@master] helm-linter: Set permissions for /var/log/envoy

https://gerrit.wikimedia.org/r/769477

Change 769482 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: update helm-lint for /var/log/envoy perms

https://gerrit.wikimedia.org/r/769482

Change 769482 merged by jenkins-bot:

[integration/config@master] jjb: update helm-lint for /var/log/envoy perms

https://gerrit.wikimedia.org/r/769482

Change 769793 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] miscweb: Update envoy to 1.18.3-1

https://gerrit.wikimedia.org/r/769793

Change 769793 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Update envoy to 1.18.3-1

https://gerrit.wikimedia.org/r/769793

Change 771053 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] kubernetes: Upgrade default envoy version to 1.18.3

https://gerrit.wikimedia.org/r/771053

Change 771053 merged by RLazarus:

[operations/puppet@production] kubernetes: Upgrade default envoy version to 1.18.3

https://gerrit.wikimedia.org/r/771053

Change 772451 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] miscweb: Restore envoy image_version to the inherited default

https://gerrit.wikimedia.org/r/772451

Change 772451 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Restore envoy image_version to the inherited default

https://gerrit.wikimedia.org/r/772451

As in T300324#7752134, I've rolled out all the k8s services where Envoy version was the only diff. We're now up to 1.18 everywhere, except for k8s services with other undeployed changes, and I'll follow up with those at the end.

Hmm, the 1.21.1 build didn't work out of the box. Running build-envoy-deb buster future got me this:

[...]
./ci/run_envoy_docker.sh ./ci/do_ci.sh bazel.release.server_only
Unable to find image 'envoyproxy/envoy-build-ubuntu:16.04.6-wmf' locally
docker: Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:16.04.6-wmf not found.
See 'docker run --help'.
[...]

I see that we're requesting 16.04.6-wmf in debian/rules (setting IMAGE_TAG which is read by run_envoy_docker.sh), but I can't find an image with that tag anywhere, including in our registry. Did we have one previously?

From the comment in that file, it sounds like it's a specific version 0c4a26... which does still exist on Docker Hub, but if "(plus sudo)" means we modified it for our purposes, I'm not sure how to proceed -- I suppose I could try setting IMAGE_TAG to that hash, and finding out whether it can build. @JMeybohm any pointers?

(I had a flash of hope that maybe we didn't have any stretch machines running Envoy anymore, so we didn't have to worry about the compatibility issue, but no such luck.)

Hmm, the 1.21.1 build didn't work out of the box. Running build-envoy-deb buster future got me this:

[...]
./ci/run_envoy_docker.sh ./ci/do_ci.sh bazel.release.server_only
Unable to find image 'envoyproxy/envoy-build-ubuntu:16.04.6-wmf' locally
docker: Error response from daemon: manifest for envoyproxy/envoy-build-ubuntu:16.04.6-wmf not found.
See 'docker run --help'.
[...]

I see that we're requesting 16.04.6-wmf in debian/rules (setting IMAGE_TAG which is read by run_envoy_docker.sh), but I can't find an image with that tag anywhere, including in our registry. Did we have one previously?

From the comment in that file, it sounds like it's a specific version 0c4a26... which does still exist on Docker Hub, but if "(plus sudo)" means we modified it for our purposes, I'm not sure how to proceed -- I suppose I could try setting IMAGE_TAG to that hash, and finding out whether it can build. @JMeybohm any pointers?

(I had a flash of hope that maybe we didn't have any stretch machines running Envoy anymore, so we didn't have to worry about the compatibility issue, but no such luck.)

It seems I've created T265357 because of this. I'm not 100% sure, but I guess I might just have created said image tag locally (maybe even by just docker exec && docker commit) on whatever envoy build host was active at that time. Sorry for not following up on this.

eventreams has been pinned to envoy 1.15.5 (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/805843). It produces non v3 compatible envoy config ("Configuration does not parse cleanly as v3. v2 configuration is deprecated and will be removed from Envoy at the start of Q1 2021 ...") because it does not use common_templates https://phabricator.wikimedia.org/T310721