Cloud: workaround new docker hub ratelimits
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Nov 30 2020, 4:56 PM

Description

We use a bunch of docker hub images. But there are new ratelimits in place.

Figure out a workaround and possibly why we even hit the ratelimit.

Details

	Subject	Repo	Branch	Lines +/-
	kubeadm-k8s: use cached calico container images	operations/puppet	production	+6 -6
	kubeadm: use calico/kube-controllers image from our internal docker registry	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T268669: Upgrade PAWS k8s to 1.17
Mentioned Here: T268669: Upgrade PAWS k8s to 1.17

Event Timeline

aborrero created this task.Nov 30 2020, 4:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2020, 4:56 PM

aborrero triaged this task as High priority.Nov 30 2020, 4:57 PM

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 644286 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry

https://gerrit.wikimedia.org/r/644286

gerritbot added a project: Patch-For-Review.Nov 30 2020, 5:02 PM

Change 644286 merged by Bstorm:
[operations/puppet@production] kubeadm: use calico/kube-controllers image from our internal docker registry

https://gerrit.wikimedia.org/r/644286

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T17:14:05Z] <bstorm> updated the calico-kube-controllers deployment to use our internal registry to deal with docker-hub rate-limiting T268669 T269016

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2020, 6:10 PM

Mentioned in SAL (#wikimedia-cloud) [2020-12-07T22:56:46Z] <bstorm> pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry T269016

Mentioned in SAL (#wikimedia-cloud) [2020-12-08T19:01:43Z] <bstorm> pushed updated calico node image (v3.14.0) to internal docker registry as well T269016

Change 647094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm-k8s: use cached calico container images

https://gerrit.wikimedia.org/r/647094

gerritbot added a project: Patch-For-Review.Dec 8 2020, 10:18 PM

Change 647094 merged by Bstorm:
[operations/puppet@production] kubeadm-k8s: use cached calico container images

https://gerrit.wikimedia.org/r/647094

Maintenance_bot removed a project: Patch-For-Review.Dec 9 2020, 5:11 PM

After deploying the above change in toolsbeta, I can say that it does create a rolling network blackout as the calico/node daemonset restarts in some cases. We *might* need to deploy it in order to get typha and calico-kube-controllers to reschedule, and we probably should in general for stability. We should mention a possible brief network issue for running pods in our notification to cloud-announce when we start work. There's a good change few will notice that, but it looked like that happened. That or the network flap I saw was literally just a timeout from the etcd servers. They seem to do that a lot without really good IO, and we probably didn't upgrade the etcd servers' image to the faster ceph one in toolsbeta. @dcaro take note on the rebuilds stuff :) Tools will need the faster setup.

It is possible that the flap was literally just the calico node restart. It's just that the main network activity from pods was to etcd (from api servers and such) because it's a quiet cluster.

aborrero closed this task as Resolved.Sep 6 2021, 12:08 PM

aborrero assigned this task to • Bstorm.

Cloud: workaround new docker hub ratelimitsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Cloud: workaround new docker hub ratelimits
Closed, ResolvedPublic
Actions