Tue, Jul 7
That sounds nice!
I would suggest to update the image version in the helmfiles.d values (e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/staging/blubberoid/values.yaml#33) instead of the chart itself, though. In general a new chart release is (or at least should be) only needed when substantial changes have been made to the containerized application (changes that would change the way the container is deployed/run, not changes to what is run inside the container).
Mon, Jul 6
VMs created, installed and ran puppet insetup role successfully. Both came up fine after reboot.
Fri, Jul 3
Thu, Jul 2
This is resolved now. For anyone passing along, please see: https://wikitech.wikimedia.org/wiki/Docker#Deleting_an_image_(from_registry) and T242604
Wed, Jul 1
Unfortunately removing all tags of an image (e.g. repository) does not remove the repository itself from the registry. What that means is that the "image" will still be listed in the catalog (GET /v2/_catalog).
This is the old Puppet CA that some docker daemons have still loaded.
Unfortunately a docker reload does not reload the CA, so we need to do a docker restart on: kubernetes[2001-2004].codfw.wmnet,kubernetes[1001-1004].eqiad.wmnet Never Kubernetes nodes already started with the updated CA and are fine.
Raising prio as we do have the same situation on prod clusters.
It's only docker that is totally sure that the certificate is not valid, so I guess it does not reload ca-certificates (even on SIGHUP).
Still getting ErrImagePull in kubectl get events:
73s Normal Pulling Pod pulling image "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-mobileapps:2020-06-29-163540-production" 73s Warning Failed Pod Failed to pull image "docker-registry.discovery.wmnet/wikimedia/mediawiki-services-mobileapps:2020-06-29-163540-production": rpc error: code = Unknown desc = Error response from daemon: Get https://docker-registry.discovery.wmnet/v1/_ping: x509: certificate has expired or is not yet valid 73s Warning Failed Pod Error: ErrImagePull
Tue, Jun 30
This led to failing docker-reporter-base-images.service on deneb. I'm definitely missing something here...
Seems it is requited to try to fetch the tag list while bypassing the caches once to have the lasting references removed:
Mon, Jun 29
I tried to delete the tags/image with the process described here but unfortunately the tags can still be pulled after successful DELETE (another DELETE even returns HTTP 404).
I guess a garbage-collection run is needed to actually remove the tags from the registry. I tried that (--dry-run) on registry2001 where is is running since 5 hours and still going. According to the output (which seems to go over all images in alphabetic order) it has reached the last image but it's still doing a lot of swift requests, so its probably not stuck...
Fri, Jun 26
Turns out our swift cluster does only support Swift V1 Auth, which ChartMuseum does not. I've tried the S3 API as well but that only supports "v2 signatures" which ChartMuseum ... does not (because the official aws-sdk-go only supports v4 signatures).
This is done and the account is working, thanks @fgiunchedi !
Wed, Jun 24
That could help but the alert should always be actionable. For that to happen the owner needs to acknowledge the need for it, which might not happen at the same time for all services.
With kube-state-metrics (sorry for me repeating this over and over 😂 ) there is kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason which can be used to detect OOM on containers.
Commit in private is e427c266f2d6ac0a937bf5d972b759933a9f9a18
I seem unable to screenshot the tooltip, but it contains the repo name and the commit message.
Tue, Jun 23
We don't expect private data in the charts at all.
In addition, they are already publicly accessible via https://releases.wikimedia.org/charts/ and https://gerrit.wikimedia.org/g/operations/deployment-charts ofc.
Thanks for writing this up @akosiaris! I think it would be nice to have the follow up tasks linked here. Like the removal of the service-runner and splitting up changeprop into multiple deployments (one per topic?).
Maybe we should also add a follow op to alert/warn on OOMKs / Container restarts?
Mon, Jun 22
I need to make decisions regarding TLS and storage:
Fri, Jun 19
@Joe I think cxserver is missing the last two steps as well, correct?
Wed, Jun 17
Tue, Jun 16
@Michael thanks for writing this up!
Thu, Jun 11
All clusters clean from envoy-tls-local-proxy image!
Wed, Jun 10
Tue, Jun 9
Jun 5 2020
Add everywhere except eventstream and eventgate.
Oh, my bad. Then we'll create them for you ofc.
Unfortunately starting with TLS right away would not permit the gradual traffic shift Alex was suggesting so it's probably better to start without and migrate to TLS in a second step. :-/
Jun 4 2020
If you want so start with TLS (via envoy) right away (which would be great!) you need to go through the extra steps of generating certificates (current document draft at https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/Add_Tls_On_Kubernetes) and "register" a TCP port at https://wikitech.wikimedia.org/wiki/Service_ports
Jun 3 2020
And I now see T242861, so please ignore what I said (or at least what I was suggesting).
I'll evaluate the route of merging the common_templates v0.2 changes into eventgate/eventstream forks instead to not have this blocked.
Oh. I see that the current canary setup will not work with my suggestions and as I see it there is currently no way on how to do it with the default scaffold/templates.
So eventgate and eventstream use forked tls_helpers (currently even the forks slightly differ).
May 29 2020
May 28 2020
Just to have the reference here. I guess it's: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry
May 27 2020
tiller has been updated in all clusters and namespaces so this is resolved now