Page MenuHomePhabricator

Upgrade calico to a more recent version (current is 3.14.0)
Closed, ResolvedPublic

Description

Since calico is containerized, the update should be fairly simple. However, comparison of the config and considerations such as iptables changes must be taken into account first. Our version is 3.8.0.

Event Timeline

JHedden moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-05-12T17:35:56Z] <bstorm_> deployed an updated bit of yaml for calico without upgrading the version first T250863

The new liveness probe does NOT work in 3.8.0. Trying with the upgrade.

Mentioned in SAL (#wikimedia-cloud) [2020-05-12T17:44:16Z] <bstorm_> set the calico version to v3.14.0 because the new liveness probe isn't compatible with the old version. T250863

I can confirm that calico upgrades are rolling with no real impact to network traffic. This makes sense because it should just impact changes to the network. That's a bit harder to see, but I have to have faith that the daemonsets maintain a reconciliation loop around that.

Messing with adding an option to enable typha because it is totally unnecessary for toolsbeta (and we might need another node just to deploy it there).

Mentioned in SAL (#wikimedia-cloud) [2020-05-12T18:35:41Z] <bstorm_> upgraded to using typha and rolled back to not doing so -- no affect on existing network T250863

Change 596012 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-kubeadm: calico upgrade changes

https://gerrit.wikimedia.org/r/596012

@aborrero I *think* this will mean we won't need to worry about the iptables fixup anymore once this is the standard in Toolforge, right?

Bstorm renamed this task from Upgrade calico to a more recent version (current is 3.13.2) to Upgrade calico to a more recent version (current is 3.14.0).May 12 2020, 6:41 PM

@aborrero I *think* this will mean we won't need to worry about the iptables fixup anymore once this is the standard in Toolforge, right?

To be on the safe side we would need to make another round of testing:

  • calico/felix. Probably needs FELIX_IPTABLESBACKEND=NFT
  • docker: make sure it plays well with iptables-nft
  • kube-proxy: make sure it plays well with iptables-nft

If we could validate all these, I would be confident on dropping the workaround.

On the other hand, our workaround (using iptables-legacy) is pretty harmless at least in this Buster release cycle. It may make more sense resource-wise to just keep using iptables-legacy for now?
There is technical debt here, but it should be primarily addressed by the upstream projects I mentioned.

Also, a side note, if we eventually start playing with IPv6 seriously we will need kube-proxy in ipvs mode, and all this stuff will change.

On the other hand, our workaround (using iptables-legacy) is pretty harmless at least in this Buster release cycle. It may make more sense resource-wise to just keep using iptables-legacy for now?
There is technical debt here, but it should be primarily addressed by the upstream projects I mentioned.

Oh yeah! I don't mean right away. That would likely break things. Good notes on what to change, too.

Also, a side note, if we eventually start playing with IPv6 seriously we will need kube-proxy in ipvs mode, and all this stuff will change.

Heh, fair! We might not even want to think about it until then, in that case.

Change 596012 merged by Bstorm:
[operations/puppet@production] toolforge-kubeadm: calico upgrade changes

https://gerrit.wikimedia.org/r/596012

Mentioned in SAL (#wikimedia-cloud) [2020-05-13T18:10:21Z] <bstorm_> set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863

Mentioned in SAL (#wikimedia-cloud) [2020-05-13T18:14:15Z] <bstorm_> upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863

So far it looks good! 3 typha pods are stable and the calico-node pods are stable as they come up.

Better yet, the 3 typha pods scheduled on separate nodes:

root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-j2klg -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
calico-typha-5cb967996c-j2klg   1/1     Running   0          9m27s   172.16.1.87   tools-k8s-worker-40   <none>           <none>
root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-xdkb4 -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
calico-typha-5cb967996c-xdkb4   1/1     Running   0          9m42s   172.16.1.70   tools-k8s-worker-41   <none>           <none>
root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-xzb6f -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP             NODE                  NOMINATED NODE   READINESS GATES
calico-typha-5cb967996c-xzb6f   1/1     Running   0          9m59s   172.16.1.170   tools-k8s-worker-56   <none>           <none>

Upgrade is complete. Overall difference in load is not easy to see because there are normally spikes here and there on the control plane. We'll have to observe over a longer period to see if typha makes a difference. It likely will make a difference as we grow more than anything.

Bstorm claimed this task.