We use a bunch of docker hub images. But there are new ratelimits in place.
Figure out a workaround and possibly why we even hit the ratelimit.
We use a bunch of docker hub images. But there are new ratelimits in place.
Figure out a workaround and possibly why we even hit the ratelimit.
Change 644286 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry
Change 644286 merged by Bstorm:
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry
Mentioned in SAL (#wikimedia-cloud) [2020-11-30T17:14:05Z] <bstorm> updated the calico-kube-controllers deployment to use our internal registry to deal with docker-hub rate-limiting T268669 T269016
Mentioned in SAL (#wikimedia-cloud) [2020-12-07T22:56:46Z] <bstorm> pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016
Mentioned in SAL (#wikimedia-cloud) [2020-12-08T19:01:43Z] <bstorm> pushed updated calico node image (v3.14.0) to internal docker registry as well T269016
Change 647094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm-k8s: use cached calico container images
Change 647094 merged by Bstorm:
[operations/puppet@production] kubeadm-k8s: use cached calico container images
After deploying the above change in toolsbeta, I can say that it does create a rolling network blackout as the calico/node daemonset restarts in some cases. We *might* need to deploy it in order to get typha and calico-kube-controllers to reschedule, and we probably should in general for stability. We should mention a possible brief network issue for running pods in our notification to cloud-announce when we start work. There's a good change few will notice that, but it looked like that happened. That or the network flap I saw was literally just a timeout from the etcd servers. They seem to do that a lot without really good IO, and we probably didn't upgrade the etcd servers' image to the faster ceph one in toolsbeta. @dcaro take note on the rebuilds stuff :) Tools will need the faster setup.
It is possible that the flap was literally just the calico node restart. It's just that the main network activity from pods was to etcd (from api servers and such) because it's a quiet cluster.