Change Details

The following situation was encountered after adding two new nodes to staging-eqiad (T293729): The staging clustes have an IPv4 IPPool of `10.64.75.0/24` with the default block size of `/26` which should, in theory, should allow for 4 nodes with one `/26` block each. Unfortunately each of the existing nodes (kubestage100[12]) currently has 2 IP blocks assigned with leaves the new nodes (kubestage100[34]) without an IP block. This is not bad per se as it just means that both nodes had to take > 64 Pods at some point during their lifetime which is the case when one node is down (at time of writing, there are ~90 Pods running in staging). As a consequence of that, workload scheduled to the new nodes //borrow// IP's from the pools of kubestage100[12] which leads to broken Pod IP routes on the node the IP was borrowed from. In this particular case, kubestage1003 borrowed `10.64.75.0` from kubestage1001, leading to kubestage1001 blackholing traffic to that Pod IP (it blackholes prefixes that it is authoritative for and relies on having specific routes for each Pod it hosts): ``` root@kubestage1003:~# ip route default via 10.64.16.1 dev eno1 onlink 10.64.16.0/22 dev eno1 proto kernel scope link src 10.64.16.55 10.64.75.0 dev calidda3b144ba9 scope link root@kubestage1001:~# ip route default via 10.64.0.1 dev eno1 onlink 10.64.0.0/22 dev eno1 proto kernel scope link src 10.64.0.247 blackhole 10.64.75.0/26 proto bird 10.64.75.4 dev cali21fc9e9645e scope link [...] # But no explicit route for 10.64.75.0/32 ``` According to calico docs, this is the default behavior: >>! https://docs.projectcalico.org/archive/v3.17/reference/resources/ippool#block-sizes > [...] If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size. This situation reveals 3 problems: ==== 1. Nodes can claim more then one IP block / We can end up with hosts not having IPv4 blocks assigned when added ==== We could: * Only allow 64 Pods per Node (effectively ensuring the Node does not require more than one block) * Only allow one IPv4 block per node[1] (effectively limiting the Node to only <size-of-block> number of Pods) * Reduce the size of blocks (to allow for more flexible allocation) * Figure out if we can free blocks when they are no longer used ==== 2. Nodes without a IP block are able to launch Pod's and assign an IPs //borrowed// other Nodes blocks ==== This can be prevented by configuring the IPAM module with StrictAffinity, permitting the borrowing of IPs from foreign blocks. I assume we would also need to modify existing blocks (`ipamblocks.crd.projectcalico.org`) ==== 3. We never got notified by our alerting that we ran out of IP blocks ==== Have not thought about that yet. ``` kubectl get blockaffinities.crd.projectcalico.org kubectl get ipamblocks.crd.projectcalico.org ``` [1] https://github.com/projectcalico/libcalico-go/pull/1297

The following situation was encountered after adding two new nodes to staging-eqiad (T293729): The staging clustes have an IPv4 IPPool of `10.64.75.0/24` with the default block size of `/26` which should, in theory, should allow for 4 nodes with one `/26` block each. Unfortunately each of the existing nodes (kubestage100[12]) currently has 2 IP blocks assigned with leaves the new nodes (kubestage100[34]) without an IP block. This is not bad per se as it just means that both nodes had to take > 64 Pods at some point during their lifetime which is the case when one node is down (at time of writing, there are ~90 Pods running in staging). As a consequence of that, workload scheduled to the new nodes //borrow// IP's from the pools of kubestage100[12] which leads to broken Pod IP routes on the node the IP was borrowed from. In this particular case, kubestage1003 borrowed `10.64.75.0` from kubestage1001, leading to kubestage1001 blackholing traffic to that Pod IP (it blackholes prefixes that it is authoritative for and relies on having specific routes for each Pod it hosts): ``` root@kubestage1003:~# ip route default via 10.64.16.1 dev eno1 onlink 10.64.16.0/22 dev eno1 proto kernel scope link src 10.64.16.55 10.64.75.0 dev calidda3b144ba9 scope link root@kubestage1001:~# ip route default via 10.64.0.1 dev eno1 onlink 10.64.0.0/22 dev eno1 proto kernel scope link src 10.64.0.247 blackhole 10.64.75.0/26 proto bird 10.64.75.4 dev cali21fc9e9645e scope link [...] # But no explicit route for 10.64.75.0/32 ``` According to calico docs, this is the default behavior: >>! https://docs.projectcalico.org/archive/v3.17/reference/resources/ippool#block-sizes > [...] If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size. This situation reveals 3 problems: ==== 1. Nodes can claim more then one IP block / We can end up with hosts not having IPv4 blocks assigned when added ==== We could: * Only allow 64 Pods per Node (effectively ensuring the Node does not require more than one block) * Only allow one IPv4 block per node[1] (effectively limiting the Node to only <size-of-block> number of Pods) * Reduce the size of blocks (to allow for more flexible allocation) * Figure out if we can free blocks when they are no longer used ==== 2. Nodes without a IP block are able to launch Pod's and assign an IPs //borrowed// other Nodes blocks ==== This can be prevented by configuring the IPAM module with StrictAffinity, permitting the borrowing of IPs from foreign blocks. I assume we would also need to modify existing blocks (`ipamblocks.crd.projectcalico.org`) ==== 3. We never got notified by our alerting that we ran out of IP blocks ==== Have not thought about that yet. Our calico version does not expose ipam metrics currently. The next major release does: https://docs.projectcalico.org/archive/v3.18/release-notes/#ipam-prometheus-metrics ``` kubectl get blockaffinities.crd.projectcalico.org kubectl get ipamblocks.crd.projectcalico.org ``` [1] https://github.com/projectcalico/libcalico-go/pull/1297