Page MenuHomePhabricator

Cloud: workaround new docker hub ratelimits
Closed, ResolvedPublic

Description

We use a bunch of docker hub images. But there are new ratelimits in place.

Figure out a workaround and possibly why we even hit the ratelimit.

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 644286 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry

https://gerrit.wikimedia.org/r/644286

Change 644286 merged by Bstorm:
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry

https://gerrit.wikimedia.org/r/644286

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T17:14:05Z] <bstorm> updated the calico-kube-controllers deployment to use our internal registry to deal with docker-hub rate-limiting T268669 T269016

Mentioned in SAL (#wikimedia-cloud) [2020-12-07T22:56:46Z] <bstorm> pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016

Mentioned in SAL (#wikimedia-cloud) [2020-12-08T19:01:43Z] <bstorm> pushed updated calico node image (v3.14.0) to internal docker registry as well T269016

Change 647094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm-k8s: use cached calico container images

https://gerrit.wikimedia.org/r/647094

Change 647094 merged by Bstorm:
[operations/puppet@production] kubeadm-k8s: use cached calico container images

https://gerrit.wikimedia.org/r/647094

After deploying the above change in toolsbeta, I can say that it does create a rolling network blackout as the calico/node daemonset restarts in some cases. We *might* need to deploy it in order to get typha and calico-kube-controllers to reschedule, and we probably should in general for stability. We should mention a possible brief network issue for running pods in our notification to cloud-announce when we start work. There's a good change few will notice that, but it looked like that happened. That or the network flap I saw was literally just a timeout from the etcd servers. They seem to do that a lot without really good IO, and we probably didn't upgrade the etcd servers' image to the faster ceph one in toolsbeta. @dcaro take note on the rebuilds stuff :) Tools will need the faster setup.

It is possible that the flap was literally just the calico node restart. It's just that the main network activity from pods was to etcd (from api servers and such) because it's a quiet cluster.

aborrero assigned this task to Bstorm.