While trying to debug the -1 from jenkins in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/692736 we noticed at the console
Error: chart "shellbox" matching version "" not found in wmf-stable index. (try 'helm repo update'). no chart name found
despite the shellbox chart having been merged fine in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/667047
chartmuseum would indeed NOT show the chart in https://helm-charts.wikimedia.org/api/stable/charts and https://helm-charts.wikimedia.org/stable/index.yaml. However it seems like for a period of time that it did show it from the comments in the gerrit change (indicating an unexpected change of state, i.e. the chart disappearing)
Going through the helm-chartctl-package-all.service on chartmuseum1001, that is being triggered by the respective systemd timer every 2m we see that it does
ExecStart=/bin/sh -c 'cd /srv/deployment-charts && /usr/bin/git pull && /usr/bin/helm-chartctl --cm-url http://localhost:8080 walk /srv/deployment-charts/charts stable'
Doing the exact same thing manually and passing -d for debug to helm-chartctl returns the following output (consistently and repeatedly)
INFO[helm-chartctl] wikifeeds-0.0.19 already exists in repo: stable INFO[helm-chartctl] similar-users-0.0.8 already exists in repo: stable INFO[helm-chartctl] rdf-streaming-updater-0.0.6 already exists in repo: stable INFO[helm-chartctl] apertium-0.0.4 already exists in repo: stable INFO[helm-chartctl] api-gateway-0.2.1 already exists in repo: stable INFO[helm-chartctl] blubberoid-0.0.26 already exists in repo: stable INFO[helm-chartctl] parsoid-0.0.6 already exists in repo: stable INFO[helm-chartctl] termbox-0.0.17 already exists in repo: stable INFO[helm-chartctl] cxserver-0.0.22 already exists in repo: stable INFO[helm-chartctl] zotero-0.0.16 already exists in repo: stable INFO[helm-chartctl] coredns-0.1.3 already exists in repo: stable INFO[helm-chartctl] eventstreams-0.3.3 already exists in repo: stable INFO[helm-chartctl] maps-vector-server-0.0.1 already exists in repo: stable INFO[helm-chartctl] mobileapps-0.0.19 already exists in repo: stable INFO[helm-chartctl] eventgate-0.2.14 already exists in repo: stable INFO[helm-chartctl] Chart already exists: /tmp/tmpcrv8ri0w/mediawiki-0.0.1.tgz INFO[helm-chartctl] Chart already exists in repository: mediawiki-0.0.1.tgz INFO[helm-chartctl] chromium-render-0.0.17 already exists in repo: stable INFO[helm-chartctl] calico-0.1.14 already exists in repo: stable INFO[helm-chartctl] push-notifications-0.0.9 already exists in repo: stable INFO[helm-chartctl] changeprop-0.9.56 already exists in repo: stable INFO[helm-chartctl] calculator-service-0.0.1 already exists in repo: stable INFO[helm-chartctl] eventrouter-0.3.6 already exists in repo: stable INFO[helm-chartctl] Chart already exists: /tmp/tmp41_1l9tp/shellbox-0.0.1.tgz INFO[helm-chartctl] Chart already exists in repository: shellbox-0.0.1.tgz INFO[helm-chartctl] raw-0.2.5 already exists in repo: stable INFO[helm-chartctl] kafka-dev-0.0.5 already exists in repo: stable INFO[helm-chartctl] kask-0.0.11 already exists in repo: stable INFO[helm-chartctl] recommendation-api-0.0.9 already exists in repo: stable INFO[helm-chartctl] citoid-0.0.17 already exists in repo: stable INFO[helm-chartctl] mathoid-0.0.35 already exists in repo: stable INFO[helm-chartctl] linkrecommendation-0.1.12 already exists in repo: stable INFO[helm-chartctl] mediawiki-dev-0.3.1 already exists in repo: stable INFO[helm-chartctl] calico-crds-0.1.0 already exists in repo: stable
of note are the 2 set of interesting lines
INFO[helm-chartctl] Chart already exists: /tmp/tmp41_1l9tp/shellbox-0.0.1.tgz
INFO[helm-chartctl] Chart already exists in repository: shellbox-0.0.1.tgz
and respectively for mediawiki-0.0.1
Looking into chartmuseum logs we see
ERROR [93773] Request served {"path": "/api/stable/charts", "comment": "", "clientIP": "127.0.0.1", "method": "POST", "statusCode": 409, "latency": "42.573761ms",
at the same timeframe.
On a hunch, I went and looked into swift directly for those charts. After doing the curl dance[1] to get a token from swift the following returns a corrupted 0 byte result despite the 200 OK HTTP status code.
curl -v -s -X GET -H "X-Auth-Token: <REDACTED>" 'https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz' > blablah removed for brevity > GET /v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz HTTP/1.1 > Host: thanos-swift.discovery.wmnet > User-Agent: curl/7.64.0 > Accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing < HTTP/1.1 200 OK < content-length: 13670 < accept-ranges: bytes < last-modified: Tue, 18 May 2021 06:54:59 GMT < etag: 078b16ce6d0dfd8cd0337c8e2cfe01c4 < x-timestamp: 1621320898.89024 < content-type: application/x-tar < x-trans-id: txeeb7b789cb254fe083297-0060a4d36c < x-openstack-request-id: txeeb7b789cb254fe083297-0060a4d36c < date: Wed, 19 May 2021 08:59:24 GMT < x-envoy-upstream-service-time: 14 < server: envoy < * Failed writing body (0 != 13670) * Closing connection 0
Same thing for shellbox-0.0.1.tgz.
Issuing a DELETE (same command albeit with -X DELETE) removes the chart from Swift (we get a 404 if we try to GET again) and sure enough a little bit later the systemd timer restarts the service and this time around everything looks fine and the chart is present quite fine in the 2 URLs above.
I've also met this a yesterday again after a report from @Mvolz (citoid-0.0.17 was missing). In that case it was ever more confusing as it seems that the chart was around for long enough for the deployer to deploy to one of the environments (deployed on 2021-05-06) before disappearing from chartmuseum (gone by following deploy window on 2021-05-13). The fix was exactly the same (DELETE and allow the systemd timer to fix it).
My first instinct is a race (with delayed effects) between the 2 instances of chartmuseum and their pushing to swift that causes in some cases a corruption of the underlying .tgz file when the underlying container gets sync between the 2 DCs. Adding @fgiunchedi for his information.
[1] The curl dance
source /etc/default/chartmuseum curl -i -H "X-Auth-User: $ST_USER" -H "X-Auth-Key: $ST_KEY" "$ST_AUTH" ... x-storage-url: https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum x-auth-token: AUTH_<TOKEN> ... curl -sSL -D - -o /tmp/foo.tgz -H "X-Auth-Token: AUTH_<TOKEN>" 'https://thanos-swift.discovery.wmnet/v1/AUTH_chartmuseum/charts/stable/mediawiki-0.0.1.tgz'
Alternatively, use swift client directly: https://wikitech.wikimedia.org/wiki/Swift/How_To#Impersonating_a_specific_swift_account