Change Details

The following situation was encountered after adding two new nodes to staging-eqiad (T293729): The staging clustes have an IPv4 IPPool of `10.64.75.0/24` with the default block size of `/26` which should, in theory, should allow for 4 nodes with one `/26` block each. Unfortunately each of the existing nodes (kubestage100[12]) currently has 2 IP blocks assigned with leaves the new nodes (kubestage100[34]) without an IP block. This is not bad per se as it just means that both nodes had to take > 64 Pods at some point during their lifetime which is the case when one node is down (at time of writing, there are ~90 Pods running in staging). As a consequence of that, workload scheduled to the new nodes //borrow// IP's from the pools of kubestage100[12] which leads to broken Pod IP routes on the node the IP was borrowed from. In this particular case, kubestage1003 borrowed `10.64.75.0` from kubestage1001, leading to kubestage1001 blackholing traffic to that Pod IP (it blackholes prefixes that it is authoritative for and relies on having specific routes for each Pod it hosts): ``` root@kubestage1003:~# ip route default via 10.64.16.1 dev eno1 onlink 10.64.16.0/22 dev eno1 proto kernel scope link src 10.64.16.55 10.64.75.0 dev calidda3b144ba9 scope link root@kubestage1001:~# ip route default via 10.64.0.1 dev eno1 onlink 10.64.0.0/22 dev eno1 proto kernel scope link src 10.64.0.247 blackhole 10.64.75.0/26 proto bird 10.64.75.4 dev cali21fc9e9645e scope link [...] # But no explicit route for 10.64.75.0/32 ``` According to calico docs, this is the default behavior: >>! https://docs.projectcalico.org/archive/v3.17/reference/resources/ippool#block-sizes > [...] If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size. This situation reveals 4 problems: ==== 1. Nodes can claim more then one IP block / We can end up with hosts not having IPv4 blocks assigned when added ==== We could: * Only allow 64 Pods per Node (effectively ensuring the Node does not require more than one block) * Only allow one IPv4 block per node[1] (effectively limiting the Node to only <size-of-block> number of Pods) * Reduce the size of blocks (to allow for more flexible allocation) * Figure out if we can free blocks when they are no longer used ** Calico >= v3.20 will release unused blocks (https://github.com/projectcalico/kube-controllers/pull/799) ==== 2. Nodes without a IP block are able to launch Pod's and assign an IPs //borrowed// other Nodes blocks ==== This can be prevented by configuring the IPAM module with StrictAffinity, permitting the borrowing of IPs from foreign blocks. I assume we would also need to modify existing blocks (`ipamblocks.crd.projectcalico.org`) ==== 3. We never got notified by our alerting that we ran out of IP blocks ==== Have not thought about that yet. Our calico version does not expose ipam metrics currently. The next major release does: https://docs.projectcalico.org/archive/v3.18/release-notes/#ipam-prometheus-metrics ==== 4. How to fix the current situation in staging ==== The fourth problem is that we need to fix this situation in staging somehow to make use of the new nodes. As the generic logic for cleaning up blocks affinity, reclaiming blocks etc. is already in place in calico kube-controllers I went ahead and recreated the problem in staging-codfw (by just creating a bunch of pods on each node to have them request a second block). I then freed that block by removing the pods again, made sure no IPs where allocated out of them via `calicoctl ipam show --show-blocks / --show-borrowed` and then bluntly deleted the ipamblock and blockaffinities objects. After that I created a bunch of Pods again to fill up the remaining block on a node and saw the deleted one being recreated with proper affinity again. ``` calicoctl ipam show --show-blocks calicoctl ipam show --show-borrowed kubectl get ipamblocks.crd.projectcalico.org,blockaffinities.crd.projectcalico.org ``` [1] https://github.com/projectcalico/libcalico-go/pull/1297

The following situation was encountered after adding two new nodes to staging-eqiad (T293729): The staging clustes have an IPv4 IPPool of `10.64.75.0/24` with the default block size of `/26` which should, in theory, should allow for 4 nodes with one `/26` block each. Unfortunately each of the existing nodes (kubestage100[12]) currently has 2 IP blocks assigned with leaves the new nodes (kubestage100[34]) without an IP block. This is not bad per se as it just means that both nodes had to take > 64 Pods at some point during their lifetime which is the case when one node is down (at time of writing, there are ~90 Pods running in staging). As a consequence of that, workload scheduled to the new nodes //borrow// IP's from the pools of kubestage100[12] which leads to broken Pod IP routes on the node the IP was borrowed from. In this particular case, kubestage1003 borrowed `10.64.75.0` from kubestage1001, leading to kubestage1001 blackholing traffic to that Pod IP (it blackholes prefixes that it is authoritative for and relies on having specific routes for each Pod it hosts): ``` root@kubestage1003:~# ip route default via 10.64.16.1 dev eno1 onlink 10.64.16.0/22 dev eno1 proto kernel scope link src 10.64.16.55 10.64.75.0 dev calidda3b144ba9 scope link root@kubestage1001:~# ip route default via 10.64.0.1 dev eno1 onlink 10.64.0.0/22 dev eno1 proto kernel scope link src 10.64.0.247 blackhole 10.64.75.0/26 proto bird 10.64.75.4 dev cali21fc9e9645e scope link [...] # But no explicit route for 10.64.75.0/32 ``` According to calico docs, this is the default behavior: >>! https://docs.projectcalico.org/archive/v3.17/reference/resources/ippool#block-sizes > [...] If there are no more blocks available then the host can take addresses from blocks allocated to other hosts. Specific routes are added for the borrowed addresses which has an impact on route table size. This situation reveals 4 problems: ==== 1. Nodes can claim more then one IP block / We can end up with hosts not having IPv4 blocks assigned when added ==== We could: * Only allow 64 Pods per Node (effectively ensuring the Node does not require more than one block) * Only allow one IPv4 block per node[1] (effectively limiting the Node to only <size-of-block> number of Pods) * Reduce the size of blocks (to allow for more flexible allocation) * Figure out if we can free blocks when they are no longer used ** Calico >= v3.20 will release unused blocks (https://github.com/projectcalico/kube-controllers/pull/799) ==== 2. Nodes without a IP block are able to launch Pod's and assign an IPs //borrowed// other Nodes blocks ==== This can be prevented by configuring the IPAM module with StrictAffinity, permitting the borrowing of IPs from foreign blocks. I assume we would also need to modify existing blocks (`ipamblocks.crd.projectcalico.org`) I tried to add a IPAMConfig (like in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/740858/) in staging-codfw and that gets picked up by calicoctl as one would expect: ``` root@kubestage2001:~# calicoctl ipam show --show-configuration +--------------------+-------+ | PROPERTY | VALUE | +--------------------+-------+ | StrictAffinity | true | | AutoAllocateBlocks | true | | MaxBlocksPerHost | 0 | +--------------------+-------+ ``` Unfortunately new blocks created after that change (and a restart of all calico components, just to be sure) still have "StrictAffinity: false" set. I could not find any docs regarding this by now, but I expect this might only have an effect per IPPool. So if there already is an IPAM block affined that has StrictAffinity: false, it will stay that way for all blocks in that pool. ==== 3. We never got notified by our alerting that we ran out of IP blocks ==== Have not thought about that yet. Our calico version does not expose ipam metrics currently. The next major release does: https://docs.projectcalico.org/archive/v3.18/release-notes/#ipam-prometheus-metrics ==== 4. How to fix the current situation in staging ==== The fourth problem is that we need to fix this situation in staging somehow to make use of the new nodes. As the generic logic for cleaning up blocks affinity, reclaiming blocks etc. is already in place in calico kube-controllers I went ahead and recreated the problem in staging-codfw (by just creating a bunch of pods on each node to have them request a second block). I then freed that block by removing the pods again, made sure no IPs where allocated out of them via `calicoctl ipam show --show-blocks / --show-borrowed` and then bluntly deleted the ipamblock and blockaffinities objects. After that I created a bunch of Pods again to fill up the remaining block on a node and saw the deleted one being recreated with proper affinity again. ``` calicoctl ipam show --show-blocks calicoctl ipam show --show-borrowed kubectl get ipamblocks.crd.projectcalico.org,blockaffinities.crd.projectcalico.org ``` [1] https://github.com/projectcalico/libcalico-go/pull/1297