Page MenuHomePhabricator

Wikikube staging clusters are out of IPv4 Pod IP's
Closed, ResolvedPublic

Description

The title is not entirely correct:
The clusters have /24 IPv4 network assigned for Pod IPs (the IP Pool). Calico IPAM splits that into pool into four IP Blocks of size /26 (64 usable IPs) which are assigned to nodes on demand. With one control-plane and two workers that was fine as during maintenance one worker could request an additional block of 64 IPs to run all scheduled pods. With two control-planes and two workers (T329827) this does not work anymore and effectively limits the number of pods per worker to 64.

The control-planes only use one of the 64 IPs of their block (only running the calico pod), effectively wasting the remaining 63 IPs as we disables "borrowing" of IPs as part of T296303: New Kubernetes nodes may end up with no Pod IPv4 block assigned.

I would suggest we lower the size of IP blocks (at least in staging) to maybe /29 (8 usable IPs) or even /30.

Unfortunately there is no way to change the block size for an already existing IP Pool, so we would either have to create one or temporarily cut connectivity going via a temporary pool as described in: https://docs.tigera.io/archive/v3.23/networking/change-block-size

Event Timeline

JMeybohm created this task.

Change #1025783 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Use a blocksize of 30 for staging ipv4 pools

https://gerrit.wikimedia.org/r/1025783

Change #1025804 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Use a blocksize of /28 for staging-eqiad ipv4 pools

https://gerrit.wikimedia.org/r/1025804

I've moved staging-codfw to /28 blocks using the process outlined in the calico docs. Instead of re-scheduling all pods twice, I just drained both nodes and left the pods in state Pending during the migration.
I had to delete the ipam block allocations and affinities manually as they where not freed automatically (or I was to impatient).

Change #1025783 merged by JMeybohm:

[operations/deployment-charts@master] Use a blocksize of /28 for staging-codfw ipv4 pools

https://gerrit.wikimedia.org/r/1025783

For the record: /30 blocks led to too many prefix announcements so the BGP sessions got blocked by the routers. As I wasn't sure about the actual limit there, I went with /28 which is still way better than /26 and allows additional nodes to join the cluster (and still get an ip block).

Change #1025804 merged by JMeybohm:

[operations/deployment-charts@master] Use a blocksize of /28 for staging-eqiad ipv4 pools

https://gerrit.wikimedia.org/r/1025804

staging-eqiad has been migrated to /28 blocks as well

Not sure if it might be worth taking a step back and weighing up what's happening here?

As I understand it there is a /24 IPv4 allocation for POD IPs for this cluster, and with the current IP block size at /26 that only provides 4 blocks?

Without knowing the details there are probably two ways to deal with this:

  1. Allocate a large block than a /24 for such use, providing more /26 blocks that can be used
  2. Keep the /24 overall allocation as it is, but make the IP blocks smaller so there are more overall (/28, /29, /30 or whatever)

From a netops perspective we are relatively agnostic here, however given this is private IP space we have some flexibility. We definitely should try to avoid making any decisions that will potentially bite us down the road. Are we potentially putting too much of a limit on the number of potential PODs per host if we use a block size of /28 or less? Might it be better to keep those block allocations at /26 to allow for growth?

Should be fine either way, but just want to raise the question. We also need to size the 'prefix limit' on our network gear appropriately, current value of 50 should be ok for /28, but we may want to adjust up if using /30 or /32.

Not sure if it might be worth taking a step back and weighing up what's happening here?

As I understand it there is a /24 IPv4 allocation for POD IPs for this cluster, and with the current IP block size at /26 that only provides 4 blocks?

Without knowing the details there are probably two ways to deal with this:

  1. Allocate a large block than a /24 for such use, providing more /26 blocks that can be used
  2. Keep the /24 overall allocation as it is, but make the IP blocks smaller so there are more overall (/28, /29, /30 or whatever)

From a netops perspective we are relatively agnostic here, however given this is private IP space we have some flexibility. We definitely should try to avoid making any decisions that will potentially bite us down the road. Are we potentially putting too much of a limit on the number of potential PODs per host if we use a block size of /28 or less? Might it be better to keep those block allocations at /26 to allow for growth?

Should be fine either way, but just want to raise the question. We also need to size the 'prefix limit' on our network gear appropriately, current value of 50 should be ok for /28, but we may want to adjust up if using /30 or /32.

Sorry for not consulting with you beforehand.
We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

Ok cool, and yep makes perfect sense in staging we won't have a high number of pods. Plan sounds good so, I just wanted to make sure we weren't being too conservative with the allocations.

Regarding the current limit of 50 routes announced max from each host I think that is still ok? We're still slightly confused about how it tripped, seems like during the change the host briefly sent more than we expected? But should be ok in general?

We decided during migration of production to a bigger Pod IP space that this will not be necessary for staging and it actually is not. The issue there (as we figured later) that the IP space is split into /26 blocks, effectively limiting the cluster size to 4 nodes (including control-plane). The change to the IP block size was made to overcome this limitation without the need of changing the Pod IP space (and therefore having to reconfigure that in various places).

Ok cool, and yep makes perfect sense in staging we won't have a high number of pods. Plan sounds good so, I just wanted to make sure we weren't being too conservative with the allocations.

Regarding the current limit of 50 routes announced max from each host I think that is still ok? We're still slightly confused about how it tripped, seems like during the change the host briefly sent more than we expected? But should be ok in general?

50 should be absolutely fine. My theory is that while I was moving around Pods and IP blocks in the cluster(s), calico tried to be clever (keep connectivity as far as possible) and started announcing /32 "blocks" for a bunch of pods - which tripped the limit.

50 should be absolutely fine. My theory is that while I was moving around Pods and IP blocks in the cluster(s), calico tried to be clever (keep connectivity as far as possible) and started announcing /32 "blocks" for a bunch of pods - which tripped the limit.

Could be yeah. Let's leave it for now, if we notice anything else strange we can take a deeper look.