Since calico is containerized, the update should be fairly simple. However, comparison of the config and considerations such as iptables changes must be taken into account first. Our version is 3.8.0.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
toolforge-kubeadm: calico upgrade changes | operations/puppet | production | +226 -13 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Resolved | • Bstorm | T246122 Upgrade the Toolforge Kubernetes cluster to v1.16 | |||
Resolved | • Bstorm | T250863 Upgrade calico to a more recent version (current is 3.14.0) |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2020-05-12T17:35:56Z] <bstorm_> deployed an updated bit of yaml for calico without upgrading the version first T250863
Mentioned in SAL (#wikimedia-cloud) [2020-05-12T17:44:16Z] <bstorm_> set the calico version to v3.14.0 because the new liveness probe isn't compatible with the old version. T250863
I can confirm that calico upgrades are rolling with no real impact to network traffic. This makes sense because it should just impact changes to the network. That's a bit harder to see, but I have to have faith that the daemonsets maintain a reconciliation loop around that.
Messing with adding an option to enable typha because it is totally unnecessary for toolsbeta (and we might need another node just to deploy it there).
Mentioned in SAL (#wikimedia-cloud) [2020-05-12T18:35:41Z] <bstorm_> upgraded to using typha and rolled back to not doing so -- no affect on existing network T250863
Change 596012 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-kubeadm: calico upgrade changes
@aborrero I *think* this will mean we won't need to worry about the iptables fixup anymore once this is the standard in Toolforge, right?
To be on the safe side we would need to make another round of testing:
- calico/felix. Probably needs FELIX_IPTABLESBACKEND=NFT
- docker: make sure it plays well with iptables-nft
- kube-proxy: make sure it plays well with iptables-nft
If we could validate all these, I would be confident on dropping the workaround.
On the other hand, our workaround (using iptables-legacy) is pretty harmless at least in this Buster release cycle. It may make more sense resource-wise to just keep using iptables-legacy for now?
There is technical debt here, but it should be primarily addressed by the upstream projects I mentioned.
Also, a side note, if we eventually start playing with IPv6 seriously we will need kube-proxy in ipvs mode, and all this stuff will change.
Oh yeah! I don't mean right away. That would likely break things. Good notes on what to change, too.
Also, a side note, if we eventually start playing with IPv6 seriously we will need kube-proxy in ipvs mode, and all this stuff will change.
Heh, fair! We might not even want to think about it until then, in that case.
Change 596012 merged by Bstorm:
[operations/puppet@production] toolforge-kubeadm: calico upgrade changes
Mentioned in SAL (#wikimedia-cloud) [2020-05-13T18:10:21Z] <bstorm_> set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade T250863
Mentioned in SAL (#wikimedia-cloud) [2020-05-13T18:14:15Z] <bstorm_> upgrading calico to 3.14.0 with typha enabled in Toolforge K8s T250863
So far it looks good! 3 typha pods are stable and the calico-node pods are stable as they come up.
Better yet, the 3 typha pods scheduled on separate nodes:
root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-j2klg -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-typha-5cb967996c-j2klg 1/1 Running 0 9m27s 172.16.1.87 tools-k8s-worker-40 <none> <none> root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-xdkb4 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-typha-5cb967996c-xdkb4 1/1 Running 0 9m42s 172.16.1.70 tools-k8s-worker-41 <none> <none> root@tools-k8s-control-1:~# kubectl -n kube-system get pods calico-typha-5cb967996c-xzb6f -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-typha-5cb967996c-xzb6f 1/1 Running 0 9m59s 172.16.1.170 tools-k8s-worker-56 <none> <none>
Upgrade is complete. Overall difference in load is not easy to see because there are normally spikes here and there on the control plane. We'll have to observe over a longer period to see if typha makes a difference. It likely will make a difference as we grow more than anything.