Page MenuHomePhabricator

Corrupted helm charts in Swift that are not properly showing up in chartmuseum
Closed, ResolvedPublic

Description

While trying to debug the -1 from jenkins in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/692736 we noticed at the console

Error: chart "shellbox" matching version "" not found in wmf-stable index. (try 'helm repo update'). no chart name found

despite the shellbox chart having been merged fine in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667047

chartmuseum would indeed NOT show the chart in https://helm-charts.wikimedia.org/api/stable/charts and https://helm-charts.wikimedia.org/stable/index.yaml. However it seems like for a period of time that it did show it from the comments in the gerrit change (indicating an unexpected change of state, i.e. the chart disappearing)

Going through the helm-chartctl-package-all.service on chartmuseum1001, that is being triggered by the respective systemd timer every 2m we see that it does

ExecStart=/bin/sh -c 'cd /srv/deployment-charts && /usr/bin/git pull && /usr/bin/helm-chartctl --cm-url http://localhost:8080 walk /srv/deployment-charts/charts stable'

Doing the exact same thing manually and passing -d for debug to helm-chartctl returns the following output (consistently and repeatedly)

INFO[helm-chartctl] wikifeeds-0.0.19 already exists in repo: stable
INFO[helm-chartctl] similar-users-0.0.8 already exists in repo: stable
INFO[helm-chartctl] rdf-streaming-updater-0.0.6 already exists in repo: stable
INFO[helm-chartctl] apertium-0.0.4 already exists in repo: stable
INFO[helm-chartctl] api-gateway-0.2.1 already exists in repo: stable
INFO[helm-chartctl] blubberoid-0.0.26 already exists in repo: stable
INFO[helm-chartctl] parsoid-0.0.6 already exists in repo: stable
INFO[helm-chartctl] termbox-0.0.17 already exists in repo: stable
INFO[helm-chartctl] cxserver-0.0.22 already exists in repo: stable
INFO[helm-chartctl] zotero-0.0.16 already exists in repo: stable
INFO[helm-chartctl] coredns-0.1.3 already exists in repo: stable
INFO[helm-chartctl] eventstreams-0.3.3 already exists in repo: stable
INFO[helm-chartctl] maps-vector-server-0.0.1 already exists in repo: stable
INFO[helm-chartctl] mobileapps-0.0.19 already exists in repo: stable
INFO[helm-chartctl] eventgate-0.2.14 already exists in repo: stable
INFO[helm-chartctl] Chart already exists: /tmp/tmpcrv8ri0w/mediawiki-0.0.1.tgz
INFO[helm-chartctl] Chart already exists in repository: mediawiki-0.0.1.tgz
INFO[helm-chartctl] chromium-render-0.0.17 already exists in repo: stable
INFO[helm-chartctl] calico-0.1.14 already exists in repo: stable
INFO[helm-chartctl] push-notifications-0.0.9 already exists in repo: stable
INFO[helm-chartctl] changeprop-0.9.56 already exists in repo: stable
INFO[helm-chartctl] calculator-service-0.0.1 already exists in repo: stable
INFO[helm-chartctl] eventrouter-0.3.6 already exists in repo: stable
INFO[helm-chartctl] Chart already exists: /tmp/tmp41_1l9tp/shellbox-0.0.1.tgz
INFO[helm-chartctl] Chart already exists in repository: shellbox-0.0.1.tgz
INFO[helm-chartctl] raw-0.2.5 already exists in repo: stable
INFO[helm-chartctl] kafka-dev-0.0.5 already exists in repo: stable
INFO[helm-chartctl] kask-0.0.11 already exists in repo: stable
INFO[helm-chartctl] recommendation-api-0.0.9 already exists in repo: stable
INFO[helm-chartctl] citoid-0.0.17 already exists in repo: stable
INFO[helm-chartctl] mathoid-0.0.35 already exists in repo: stable
INFO[helm-chartctl] linkrecommendation-0.1.12 already exists in repo: stable
INFO[helm-chartctl] mediawiki-dev-0.3.1 already exists in repo: stable
INFO[helm-chartctl] calico-crds-0.1.0 already exists in repo: stable

of note are the 2 set of interesting lines

INFO[helm-chartctl] Chart already exists: /tmp/tmp41_1l9tp/shellbox-0.0.1.tgz
INFO[helm-chartctl] Chart already exists in repository: shellbox-0.0.1.tgz

and respectively for mediawiki-0.0.1

Looking into chartmuseum logs we see

ERROR        [93773] Request served        {"path": "/api/stable/charts", "comment": "", "clientIP": "127.0.0.1", "method": "POST", "statusCode": 409, "latency": "42.573761ms",

at the same timeframe.

On a hunch, I went and looked into swift directly for those charts. After doing the curl dance[1] to get a token from swift the following returns a corrupted 0 byte result despite the 200 OK HTTP status code.

curl -v -s -X GET -H "X-Auth-Token: <REDACTED>" 'https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz' 
> blablah removed for brevity
> GET /v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz HTTP/1.1
> Host: thanos-swift.discovery.wmnet
> User-Agent: curl/7.64.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/1.1 200 OK
< content-length: 13670
< accept-ranges: bytes
< last-modified: Tue, 18 May 2021 06:54:59 GMT
< etag: 078b16ce6d0dfd8cd0337c8e2cfe01c4
< x-timestamp: 1621320898.89024
< content-type: application/x-tar
< x-trans-id: txeeb7b789cb254fe083297-0060a4d36c
< x-openstack-request-id: txeeb7b789cb254fe083297-0060a4d36c
< date: Wed, 19 May 2021 08:59:24 GMT
< x-envoy-upstream-service-time: 14
< server: envoy
< 
* Failed writing body (0 != 13670)
* Closing connection 0

Same thing for shellbox-0.0.1.tgz.

Issuing a DELETE (same command albeit with -X DELETE) removes the chart from Swift (we get a 404 if we try to GET again) and sure enough a little bit later the systemd timer restarts the service and this time around everything looks fine and the chart is present quite fine in the 2 URLs above.

I've also met this a yesterday again after a report from @Mvolz (citoid-0.0.17 was missing). In that case it was ever more confusing as it seems that the chart was around for long enough for the deployer to deploy to one of the environments (deployed on 2021-05-06) before disappearing from chartmuseum (gone by following deploy window on 2021-05-13). The fix was exactly the same (DELETE and allow the systemd timer to fix it).

My first instinct is a race (with delayed effects) between the 2 instances of chartmuseum and their pushing to swift that causes in some cases a corruption of the underlying .tgz file when the underlying container gets sync between the 2 DCs. Adding @fgiunchedi for his information.

[1] The curl dance

source /etc/default/chartmuseum
curl -i -H "X-Auth-User: $ST_USER" -H "X-Auth-Key: $ST_KEY" "$ST_AUTH"
...
x-storage-url: https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum
x-auth-token: AUTH_<TOKEN>
...

curl -sSL -D - -o /tmp/foo.tgz -H "X-Auth-Token: AUTH_<TOKEN>" 'https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz'

Alternatively, use swift client directly: https://wikitech.wikimedia.org/wiki/Swift/How_To#Impersonating_a_specific_swift_account

Event Timeline

This is not good, it could baldy affect deploys. Setting to High.

I am wondering whether setting RandomizedDelaySec to something different than 0 (which is set to right now) would be a sufficient mitigation. These are small files (a few KBs each), so giving a few seconds to swift to propagate changes between DCs might be enough.

I agree with the analysis @akosiaris, if two processes are racing to update the same file(s) then anything could happen. As for the fix, does chartmuseum walk/upload in some defined order? If so then yeah starting with a delay might work, the more robust solution IMHO is stagger start time across sites over the course of a day

Change 693352 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Allow absenting the helmchartctl systemd timer

https://gerrit.wikimedia.org/r/693352

Change 693352 merged by Alexandros Kosiaris:

[operations/puppet@production] Allow absenting the helmchartctl systemd timer

https://gerrit.wikimedia.org/r/693352

Change 693367 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] Add an extra latency bucket at 10s

https://gerrit.wikimedia.org/r/693367

Change 693367 merged by jenkins-bot:

[operations/deployment-charts@master] Add an extra latency bucket at 10s

https://gerrit.wikimedia.org/r/693367

mathoid-0.0.36 was uploaded fine to chartmuseum1001, so we got 1 data point that corroborates the hypothesis, but we definitely need more.

The bad news is that chartmuseum2001 did not see the change (33 minutes after the merge)

akosiaris@chartmuseum1001:~$ curl -s -k https://chartmuseum1001.eqiad.wmnet/stable/index.yaml |grep mathoid-0.0.3[56]
    - charts/mathoid-0.0.36.tgz
    - charts/mathoid-0.0.35.tgz
akosiaris@chartmuseum1001:~$ curl -s -k https://chartmuseum2001.codfw.wmnet/stable/index.yaml |grep mathoid-0.0.3[56]
    - charts/mathoid-0.0.35.tgz

It did see it after a restart though. So it does hold some cache (probably in memory?) that get regenerated after a restart. That means an "active/passive" scenario wouldn't work reliably, we need to switch back to the active/active scenario we were already, probably solving the corruption issue via some timer mangling so they don't coincide.

The bad news is that chartmuseum2001 did not see the change (33 minutes after the merge)

We could also try to set --cache-interval= to a short interval which should force cache invalidation after that time. And maybe disable the index-cache.yaml with --disable-statefiles as it seems from the code that that one is only read on cache init anyways.

Change 693419 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mathoid: Switch to native prometheus exporter

https://gerrit.wikimedia.org/r/693419

Change 693419 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Switch to native prometheus exporter

https://gerrit.wikimedia.org/r/693419

On a hunch, I went and looked into swift directly for those charts. After doing the curl dance[1] to get a token from swift the following returns a corrupted 0 byte result despite the 200 OK HTTP status code.

I think the 0 byte response is due to how curl was invoked. If the output is redirected into a file (or -o is used) all bytes are retrieved correctly.

This is not good, it could baldy affect deploys. Setting to High.

I am wondering whether setting RandomizedDelaySec to something different than 0 (which is set to right now) would be a sufficient mitigation. These are small files (a few KBs each), so giving a few seconds to swift to propagate changes between DCs might be enough.

I agree with the analysis @akosiaris, if two processes are racing to update the same file(s) then anything could happen. As for the fix, does chartmuseum walk/upload in some defined order? If so then yeah starting with a delay might work, the more robust solution IMHO is stagger start time across sites over the course of a day

The timer is configured as OnUnitInactiveSec=120s, so there should already be some variance in when it is actually run. Based on the 0 byte response from curl being a red herring I'm not convinced the chart tgz was actually broken but it was just the "other" (e.g. the one that had not uploaded the new chart version to swift) chartmuseum instance did not know about it.

Reading through all the changes made between chartmuseum 0.12 and 0.13.1 revealed the sneaky change Cleanup latest cache changes which slipped me during review of the changelog. It actually looks more a rewrite of the caching code plus it turns the cache interval of by default, meaning chartmuseum will no longer pull the most recent index-cache.yaml from swift on a regular basis.

While code comments suggest the cache interval was 5min, is was actually set to 60s per default (and is now disabled). So I will reintroduce that cache interval of 60s (as we did not had any issues prior to the chartmuseum upgrade) and revoke the change from Alex.

Change 697584 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] chartmuseum: Set cache-interval to 60s

https://gerrit.wikimedia.org/r/697584

Change 697584 merged by JMeybohm:

[operations/puppet@production] chartmuseum: Set cache-interval to 60s

https://gerrit.wikimedia.org/r/697584

Both nodes run packaging timers again and the cache-interval is set to 60s. I'll keep an eye.

Looks fine so far, I'll resolve this.