Page MenuHomePhabricator

New Kubernetes nodes may end up with no Pod IPv4 block assigned
Open, MediumPublic

Description

The following situation was encountered after adding two new nodes to staging-eqiad (T293729):

The staging clustes have an IPv4 IPPool of 10.64.75.0/24 with the default block size of /26 which should, in theory, should allow for 4 nodes with one /26 block each. Unfortunately each of the existing nodes (kubestage100[12]) currently has 2 IP blocks assigned with leaves the new nodes (kubestage100[34]) without an IP block. This is not bad per se as it just means that both nodes had to take > 64 Pods at some point during their lifetime which is the case when one node is down (at time of writing, there are ~90 Pods running in staging).

As a consequence of that, workload scheduled to the new nodes borrow IP's from the pools of kubestage100[12] which leads to broken Pod IP routes on the node the IP was borrowed from.
In this particular case, kubestage1003 borrowed 10.64.75.0 from kubestage1001, leading to kubestage1001 blackholing traffic to that Pod IP (it blackholes prefixes that it is authoritative for and relies on having specific routes for each Pod it hosts):

root@kubestage1003:~# ip route
default via 10.64.16.1 dev eno1 onlink
10.64.16.0/22 dev eno1 proto kernel scope link src 10.64.16.55
10.64.75.0 dev calidda3b144ba9 scope link

root@kubestage1001:~# ip route
default via 10.64.0.1 dev eno1 onlink
10.64.0.0/22 dev eno1 proto kernel scope link src 10.64.0.247
blackhole 10.64.75.0/26 proto bird
10.64.75.4 dev cali21fc9e9645e scope link
[...]
# But no explicit route for 10.64.75.0/32

According to calico docs, this is the default behavior:

[...] If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size.

This situation reveals 4 problems:

1. Nodes can claim more then one IP block / We can end up with hosts not having IPv4 blocks assigned when added

We could:

  • Only allow 64 Pods per Node (effectively ensuring the Node does not require more than one block)
  • Only allow one IPv4 block per node[1] (effectively limiting the Node to only <size-of-block> number of Pods)
  • Reduce the size of blocks (to allow for more flexible allocation)
  • Figure out if we can free blocks when they are no longer used
2. Nodes without a IP block are able to launch Pod's and assign an IPs borrowed other Nodes blocks

This can be prevented by configuring the IPAM module with StrictAffinity, permitting the borrowing of IPs from foreign blocks. I assume we would also need to modify existing blocks (ipamblocks.crd.projectcalico.org)

I tried to add a IPAMConfig (like in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/740858/) in staging-codfw and that gets picked up by calicoctl as one would expect:

root@kubestage2001:~# calicoctl ipam show --show-configuration
+--------------------+-------+
|      PROPERTY      | VALUE |
+--------------------+-------+
| StrictAffinity     | true  |
| AutoAllocateBlocks | true  |
| MaxBlocksPerHost   |     0 |
+--------------------+-------+

Unfortunately new blocks created after that change (and a restart of all calico components, just to be sure) still have "StrictAffinity: false" set. I could not find any docs regarding this by now, but I expect this might only have an effect per IPPool. So if there already is an IPAM block affined that has StrictAffinity: false, it will stay that way for all blocks in that pool.

3. We never got notified by our alerting that we ran out of IP blocks

Have not thought about that yet. Our calico version does not expose ipam metrics currently. The next major release does: https://docs.projectcalico.org/archive/v3.18/release-notes/#ipam-prometheus-metrics

4. How to fix the current situation in staging

The fourth problem is that we need to fix this situation in staging somehow to make use of the new nodes.
As the generic logic for cleaning up blocks affinity, reclaiming blocks etc. is already in place in calico kube-controllers I went ahead and recreated the problem in staging-codfw (by just creating a bunch of pods on each node to have them request a second block).
I then freed that block by removing the pods again, made sure no IPs where allocated out of them via calicoctl ipam show --show-blocks / --show-borrowed and then bluntly deleted the ipamblock and blockaffinities objects. After that I created a bunch of Pods again to fill up the remaining block on a node and saw the deleted one being recreated with proper affinity again.

calicoctl ipam show --show-blocks
calicoctl ipam show --show-borrowed
kubectl get ipamblocks.crd.projectcalico.org,blockaffinities.crd.projectcalico.org

[1] https://github.com/projectcalico/libcalico-go/pull/1297

Event Timeline

JMeybohm created this task.

Change 740858 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Allow to configure the IPAM module

https://gerrit.wikimedia.org/r/740858

JMeybohm updated the task description. (Show Details)

I executed the steps outlined in "4. How to fix the current situation in staging" for staging-eqiad now to unblock T293729

Nice writeup.

Regarding 3. looks like the solution is to upgrade and then scrape and use alertmanager to set us up. That should make sure we aren't caught off-guard in the future.

Regarding 4. looks like we got a way out of the problem, +1 to use it as a stopgap.

And now to the more interesting parts. Regarding 2, that config seemed to be the default, so I went down the rabbithole of independently reproducing @JMeybohm's findings and extending them to the ippool. It's described below, but it's also false as I point out later.

So already assigned ipam blocks will have whatever value it had when they are created. Looking at a sample ipamblocks.crd.projectcalico.org supports that theory as it has:

kubectl describe ipamblocks.crd.projectcalico.org |grep -E 'Name:|Affinity'
Name:         10-192-75-0-26
  Affinity:  host:kubestage2001.codfw.wmnet
  Strict Affinity:  false

it does not look like strict affinity is an attribute that an ippool stores.

To verify the above, I set strict affinity for staging-codfw and then drained kubestage2002.

akosiaris@kubestage2001:~$ sudo calicoctl ipam configure --strictaffinity=true
Successfully set StrictAffinity to: true
root@deploy1002:~# kubectl drain --ignore-daemonsets --delete-local-data kubestage2002.codfw.wmnet

And a new pool did get created but did not have Strict Affinity: true

Name:         10-192-75-128-26
  Affinity:  host:kubestage2001.codfw.wmnet
  Strict Affinity:  false

This did not make much sense. To exclude also the possibility that it is somehow an attribute of the ippool, I also drain kubestage2001, effectively killing all pods. The ippool turned out empty, so I deleted it. I verified that calicoctl ipam show --show-blocks did not return any ipv4 blocks. I then recreated the previous ippool uncordoned kubestage2001. Unfortunately while 2 new ipam blocks were created (we run 83 pods in staging-codfw which is > 64), in their data we don't see Strict Affinity true.

kubectl describe ipamblocks.crd.projectcalico.org |grep -E 'Name:|Affinity'
Name:         10-192-75-0-26
  Affinity:  host:kubestage2001.codfw.wmnet
  Strict Affinity:  false
Name:         10-192-75-64-26
  Affinity:  host:kubestage2001.codfw.wmnet
  Strict Affinity:  false

But then it hit me. Strict Affinity in IPAMConfig is not a default value, but a master toggle. That is, this design allows a node to limit it's own ipamblocks from having their IPs borrowed while all the other nodes can borrow each others IP addresses (I haven't found though how that is exposed to operators). But the master toggle overrides all that if set.

To verify that I went with the following plan. Due to kubelet, node sizing constraints, I deleted the IPv4 pool again and recreated it, albeit with a /25 and not a /24 size. I uncordoned 1 node and due to the 83 pods there were allocated 2 /26 on kubestage2001. Then I uncordoned the other node and I deleted a pod so that it gets scheduled there. And soon enough

3m1s        Warning   FailedCreatePodSandBox   pod/zotero-staging-84fbd5877d-8m2wx    Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "fece35f722ef9846720344c5795d8fb7091bf4b2bf4c5127768aacf4e2d4009f" network for pod "zotero-staging-84fbd5877d-8m2wx": networkPlugin cni failed to set up pod "zotero-staging-84fbd5877d-8m2wx_zotero" network: failed to request 1 IPv4 addresses. IPAM allocated only 0

and the pod staying in a ContainerCreating status. No ip route rules get created. That's pretty ok and mostly what we want. No pods will be running or receiving traffic which is good. The only bad thing here is the ContainerCreating status. That is, the pod stays scheduled to a node that can not serve it. Getting out of that state requires the pod being deleted so that it may be sent to another node.

Setting strict affinity back to false, stops the above and we get almost instantly on the situation the task describes.

Given our infrastructure, it's probably best that we are always configured with strict affinity=true. We can revisit this if we end up enabling bgp mesh or at least row-level bgp mesh.

Finally, regarding 1, it's allow about striking the balance of how many pods we want per node. e.g. we can also set the kubelet to have 129 (128 + 1 for calico-node) pods instead of the default 110 or the proposed 64. The best solution is one the scheduler is also aware of so that we don't end up in situation like the one described above. I am not sure if we can have that though, we will need to experiment a bit. IPv4 private addresses are cheaper than hardware nodes which are in turn cheaper than developer/sre debugging time, so let's optimize for an environment where have some piece of mind.

Mentioned in SAL (#wikimedia-operations) [2021-11-26T11:41:17Z] <akosiaris> T296303 cleanup weird state of calico-codfw cluster

JMeybohm lowered the priority of this task from High to Medium.Dec 8 2021, 3:34 PM

But then it hit me. Strict Affinity in IPAMConfig is not a default value, but a master toggle. That is, this design allows a node to limit it's own ipamblocks from having their IPs borrowed while all the other nodes can borrow each others IP addresses (I haven't found though how that is exposed to operators). But the master toggle overrides all that if set.

I fear I don't get that. What do you mean by "master toggle"? I don't quite understand how you set "strict affinity = True" during your following tests but it fells like I'm totally missing something.

But then it hit me. Strict Affinity in IPAMConfig is not a default value, but a master toggle. That is, this design allows a node to limit it's own ipamblocks from having their IPs borrowed while all the other nodes can borrow each others IP addresses (I haven't found though how that is exposed to operators). But the master toggle overrides all that if set.

I fear I don't get that. What do you mean by "master toggle"?

I mean an on/off switch that when set to on, regardless of the per block setting, borrowing will happen. When set to off, it looks like it can be on a per block basis.

I don't quite understand how you set "strict affinity = True" during your following tests but it fells like I'm totally missing something.

calicoctl ipam configure --strictaffinity=true (and false afterwards)

I'm sill lost unfortunately.

My understanding was that calicoctl ipam configure --strictaffinity=true is basically the same as creating an IPAMConfig with strictAffinity: true. Which has no effect on already assigned ipamblocks. Ah, maybe now I understand. 😃 Are you saying that, with strictAffinity: true no borrowing is happening regardless of the Strict Affinity value of ipamblock's (so I basically just not saw that happen because I did only look at the ipamblock's created rather then actually testing)?

Are you saying that, with strictAffinity: true no borrowing is happening regardless of the Strict Affinity value of ipamblock's (so I basically just not saw that happen because I did only look at the ipamblock's created rather then actually testing)?

Yup, exactly that!

Are you saying that, with strictAffinity: true no borrowing is happening regardless of the Strict Affinity value of ipamblock's (so I basically just not saw that happen because I did only look at the ipamblock's created rather then actually testing)?

Yup, exactly that!

Nice! How convenient. :-)
Thanks for explaining!

Change 740858 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Allow to configure the IPAM module

https://gerrit.wikimedia.org/r/740858

Mentioned in SAL (#wikimedia-operations) [2021-12-09T14:37:37Z] <jayme> updated calico chart to calico-0.1.15 on all kubernetes clusters (introducing IPAMConfig) - T296303