Page MenuHomePhabricator

Upgrade all TLS enabled charts to v0.2 tls_helper
Closed, ResolvedPublic

Description

When all charts are migrated, we can:

  • discontinue maintenance for the "envoy-tls-local-proxy"
  • remove "envoy-tls-local-proxy" from git
  • maybe remove the images as well (so they are not used by accident)?

Charts using v0.1 _tls_helper:

  • blubberoid
  • citoid
  • cxserver
  • wikifeeds
  • chromium-render
  • mobileapps
  • recommendation-api

Special charts that don't use the common helper:

  • changeprop/templates/deployment.yaml: image: {{ .Values.docker.registry }}/envoy-tls-local-proxy:{{ .Values.tls.image_version }}
  • eventgate/templates/_tls_helpers.tpl: image: {{ .Values.docker.registry }}/envoy-tls-local-proxy:{{ .Values.tls.image_version }}
  • eventstreams/templates/_tls_helpers.tpl: image: {{ .Values.docker.registry }}/envoy-tls-local-proxy:{{ .Values.tls.image_version }}

To figure out if envoy-tls-local-proxyis still running somewhere:

kubectl get pods --all-namespaces -o go-template --template='{{range .items}}{{$n := .metadata.name}}{{range .spec.containers}}{{$n}}{{":\t"}}{{.image}}{{"\n"}}{{end}}{{end}}' | grep envoy-tls-local

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMay 22 2020, 5:05 PM

Change 598759 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] blubberoid: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598759

Change 598760 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] charts: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598760

Change 598766 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] cxserver: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598766

Change 598774 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] wikifeeds: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598774

Change 598777 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] chromium-render: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598777

Change 598779 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] mobileapps: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598779

Change 598780 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] recommendation-api: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598780

I've used something like this to update the charts:

function update_helper() {
    CHART=$(basename $1)
    git checkout -b ${CHART}_tls_v0.2 master
    
    pushd $CHART

    version=$(yq r Chart.yaml version)
    a=( ${version//./ } )
    ((a[2]++))
    new_version="${a[0]}.${a[1]}.${a[2]}"
    yq w -i Chart.yaml version $new_version

    pushd templates
    find . -type l -exec bash -c 'rm $0; ln -s ../../../common_templates/0.2/$(basename $0)' {} \;
    popd

    yq w -i values.yaml tls.image_version 1.13.1-2
    yq w -i .fixtures/tls_enabled.yaml tls.image_version 1.13.1-2
    popd

    helm package $CHART
    helm repo index .
    
    git add -u .
    git add ${CHART}-${new_version}.tgz
    git commit -e -m "$(echo -e "${CHART}: Update to v0.2 helpers\n\nBug: T253396")"
}

Change 598759 merged by jenkins-bot:
[operations/deployment-charts@master] blubberoid: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598759

Change 599289 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] blubberoid: Use chart defaults for deployment

https://gerrit.wikimedia.org/r/599289

Change 599289 merged by jenkins-bot:
[operations/deployment-charts@master] blubberoid: Use chart defaults for deployment

https://gerrit.wikimedia.org/r/599289

Change 598780 merged by jenkins-bot:
[operations/deployment-charts@master] recommendation-api: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598780

Change 598779 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598779

Change 598760 merged by jenkins-bot:
[operations/deployment-charts@master] citoid: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598760

Change 598777 merged by jenkins-bot:
[operations/deployment-charts@master] chromium-render: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598777

Change 598766 merged by jenkins-bot:
[operations/deployment-charts@master] cxserver: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598766

Change 598774 merged by jenkins-bot:
[operations/deployment-charts@master] wikifeeds: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/598774

Change 600862 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] changeprop: Migrate to common_templates 0.2 tls_helper

https://gerrit.wikimedia.org/r/600862

So eventgate and eventstream use forked tls_helpers (currently even the forks slightly differ).

As far as I can tell, this is due to types of reasons. One being diverging use of .metadata.labels.app and .metadata.labels.chart and the other one is the additional label .metadata.labels.routing_tag.

  • .metadata.labels.app is charts .Values.main_app.name. This could be replaced with the default value when using .Values.main_app.name as part of the helm release name (like main-production instead of production for example). Alternatively .Values.chartName could be used to overwrite the chart name with eventgate-main, keeping production as helm release name.
  • .metadata.labels.chart uses the wmf.chartname template. The common_templates have wmf.chartname in .metadata.labels.app.
  • .metadata.labels.routing_tag is set to .Release.Name by default and is used as additional selector for the services. I don't see why this is needed. As we run one release (except for canary) per namespace and the kubernetes service will always only consider pods in it's namespace it will only select the correct release anyways. So maybe this can be removed.

So, my suggestion would be to switch to common_helpers, use .Values.chartName instead of .Values.main_app.name and drop .metadata.labels.routing_tag completely.

@Ottomata what do you think?

Oh. I see that the current canary setup will not work with my suggestions and as I see it there is currently no way on how to do it with the default scaffold/templates.

And I now see T242861, so please ignore what I said (or at least what I was suggesting).
I'll evaluate the route of merging the common_templates v0.2 changes into eventgate/eventstream forks instead to not have this blocked.

Change 602060 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventstreams: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/602060

Change 602061 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] eventgate: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/602061

Change 600862 merged by jenkins-bot:
[operations/deployment-charts@master] changeprop: Migrate to common_templates 0.2 tls_helper

https://gerrit.wikimedia.org/r/600862

Change 602060 merged by jenkins-bot:
[operations/deployment-charts@master] eventstreams: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/602060

Change 602061 merged by jenkins-bot:
[operations/deployment-charts@master] eventgate: Update to v0.2 helpers

https://gerrit.wikimedia.org/r/602061

All clusters clean from envoy-tls-local-proxy image!

Change 608277 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] Remove deprecated and unmaintained image: envoy-tls-local-proxy

https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/ /608277

Mentioned in SAL (#wikimedia-operations) [2020-06-29T12:32:50Z] <jayme> deleted all tags for docker-registry.wikimedia.org/envoy-tls-local-proxy from docker registry - T253396

I tried to delete the tags/image with the process described here[1] but unfortunately the tags can still be pulled after successful DELETE (another DELETE even returns HTTP 404).
I guess a garbage-collection[2] run is needed to actually remove the tags from the registry. I tried that (--dry-run) on registry2001 where is is running since 5 hours and still going. According to the output (which seems to go over all images in alphabetic order) it has reached the last image but it's still doing a lot of swift requests, so its probably not stuck...

Seems it is requited to try to fetch the tag list while bypassing the caches once to have the lasting references removed:
curl https://docker-registry.wikimedia.org/v2/envoy-tls-local-proxy/tags/list?x=y

Unfortunately I did not manage to remove the "repository" itself (docker-registry.wikimedia.org/v2/envoy-tls-local-proxy).

For reference:
There is a tool called docker-registryctl in https://gerrit.wikimedia.org/g/operations/docker-images/docker-report which can be used to delete image tags (via the docker-registry API as well).

Change 608277 merged by JMeybohm:
[operations/docker-images/production-images@master] Remove deprecated and unmaintained image: envoy-tls-local-proxy

https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/ /608277

Mentioned in SAL (#wikimedia-operations) [2020-06-30T11:31:58Z] <jayme> pushed a scratch docker image as docker-registry.discovery.wmnet/envoy-tls-local-proxy:dontuseme - T253396

Mentioned in SAL (#wikimedia-operations) [2020-06-30T11:32:02Z] <jayme> restarted docker-reporter-base-images and docker-reporter-releng-images on deneb - T253396

This led to failing docker-reporter-base-images.service on deneb. I'm definitely missing something here...

JMeybohm added a subscriber: Joe.

This is resolved now. For anyone passing along, please see: https://wikitech.wikimedia.org/wiki/Docker#Deleting_an_image_(from_registry) and T242604

Thanks @Joe