Page MenuHomePhabricator

WikiKube clusters close to exhausting Calico IPPool allocations
Closed, ResolvedPublic

Description

Summary

We will merging 2 /18s to a single /17 for IPv4 during the next k8s version upgrade T341984: Update Kubernetes clusters to 1.31

Long form

During putting in service the new wikikube-workers for T369744: wikikube-worker1240 to wikikube-worker1304 implementation tracking, we 've encountered an issue where a number of wikikube-workers were left without an IPv4 /26 prefix because we 've initially allocated a /18, which containers 256 /26 prefixes (we ended up with 286 nodes when the new nodes were pooled). While the original projection of 256 /26s being enough wasn't entirely wrong (221 was the node count before adding the nodes for the refresh), it doesn't leave enough space to perform large-ish operations (e.g. like adding 65 nodes).

While:

  1. we 'll easily fix this by decomming the old nodes that the new nodes were slated to refresh T375842: decommission mw[1349-1413]
  2. we could avoid this in the future by instating rules like "don't ever add more than X nodes in a batch "

long term, it is probably more productive to just add 1 more /18 to the wikikube pool for both DCs. We are lucky (or I had the foresight, can't remember) and the /18s we currently have are followed by another empty and already reserved for Kubernetes /18. Namely:

we already have

10.67.128.0/18 (eqiad), 10.194.128.0/18 (codfw)

and

10.67.192.0/18 (eqiad), 10.67.192.0/18 (codfw) are available.

I went ahead and marked them as reserved with a description to this task already pending discussion

Note that per past experience, changing the ippool is a arduous and dangerous process. We probably don't need that though and can live without aggregating on the configuration level the 2 /18s to a /17. On the BGP level, it's /26s anyway that get announced.

Tagging netops and serviceops for further discussion.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Note that per past experience, changing the ippool is a arduous and dangerous process. We probably don't need that though and can live without aggregating on the configuration level the 2 /18s to a /17. On the BGP level, it's /26s anyway that get announced.

It indeed is, at least when changing the ip block size T345823: Wikikube staging clusters are out of IPv4 Pod IP's. It might be possible though to migrate to a new IPPool that contains the old one. We might as well aggregate the two pools into one when recreating the cluster for T341984: Update Kubernetes clusters to 1.31.

But we need to keep in mind that the cluster CIDR (pod IP range) is configured in multiple places in hiera (at least) and it needs to be configured in kube-proxy as well - where only one CIDR per stack (IPv4/IPv6) is supported.

Note that per past experience, changing the ippool is a arduous and dangerous process. We probably don't need that though and can live without aggregating on the configuration level the 2 /18s to a /17. On the BGP level, it's /26s anyway that get announced.

It indeed is, at least when changing the ip block size T345823: Wikikube staging clusters are out of IPv4 Pod IP's. It might be possible though to migrate to a new IPPool that contains the old one. We might as well aggregate the two pools into one when recreating the cluster for T341984: Update Kubernetes clusters to 1.31.

But we need to keep in mind that the cluster CIDR (pod IP range) is configured in multiple places in hiera (at least) and it needs to be configured in kube-proxy as well - where only one CIDR per stack (IPv4/IPv6) is supported.

Good points. We also have that configured in homer: https://gerrit.wikimedia.org/g/operations/homer/public/+/0fd0ec8d4881442c8471e46092ed2e2eff5ecab2/templates/includes/customers/64601.policy#1 and it probably should be the very first one to change.

It might be possible though to migrate to a new IPPool that contains the old one.

The ranges Alex proposes to add mean we can aggregate into two ranges, 10.67.128.0/17 (eqiad), 10.194.128.0/17 (codfw), which is quite neat.

For the homer filtering (or anywhere else we might have them), changing the netmask on the existing definitions from /18 to /17 would work fine, so I think from netops point of view this seems an easy way forward.

We might as well aggregate the two pools into one when recreating the cluster for T341984: Update Kubernetes clusters to 1.31.

If we do this we probably need to allocate a new single pool, and renumber existing things. The current pools are from IP space assigned to the codfw/eqiad, so if we wanted a single pool covering both sites we probably need to pick it from some "neutral" space not allocated to any particular site.

If we do this we probably need to allocate a new single pool, and renumber existing things. The current pools are from IP space assigned to the codfw/eqiad, so if we wanted a single pool covering both sites we probably need to pick it from some "neutral" space not allocated to any particular site.

I wasn't very precise sorry. What I meant was we can aggregate the two pools per DC into one when recreating the clusters for the k8s upgrade (so no cross DC pools).

cmooney triaged this task as Medium priority.Oct 21 2024, 1:52 PM

What's the current thinking here? Are we agreed to widen both allocations to /17s?

We also have that configured in homer: https://gerrit.wikimedia.org/g/operations/homer/public/+/0fd0ec8d4881442c8471e46092ed2e2eff5ecab2/templates/includes/customers/64601.policy#1 and it probably should be the very first one to change.

If we're happy to go this way I can create patches to update those and push out on the network side.

Good question. Let me add some data points. We currently use:

root@deploy1003:~# kube_env admin eqiad 
root@deploy1003:~# kubectl get ipamblocks.crd.projectcalico.org  |grep ^10 | wc -l
243

243 in eqiad and

root@deploy1003:~# kube_env admin codfw
root@deploy1003:~# kubectl get ipamblocks.crd.projectcalico.org  |grep ^10 | wc -l
226

226 in codfw. The higher number in eqiad is indeed because we have more nodes in eqiad currently. But we do have a decom task for 18 of them so we 'll soon be in similar numbers. Incoming batches/refreshes for both DCs still have them below that limit, but still this is going to bite us again at the medium term future. There is a chance we might be able next year to not refresh a batch of hosts and rather decom them shrinking down the WikiKube cluster, but I 'd rather not rely on this.

That's for the urgency of this.

Now for the difficulty. Adding the new /18 pool in homer doesn't appear difficult at all. Whether that's as a /17 or 2 /18s, from homer's side is mostly an implementation detail.

There is also some easy pickings at various things that need to be altered from a /18 to a /17. Mostly datastore that pods access as clients, informative data structure and some policies.

All of these can and should be done before the actual change on the WikiKube cluster.

Now, for the actual change. The biggest issue is coordinating the rollout of changing this and this. We 've never done it on a live cluster with hundreds of nodes and thousands of requests. Hence the "stalling".

We 'll need to re-validate what we know setting and run some tests first and see how feasible it is. The fallback would be to do it during the upgrade of the clusters per T341984

Good question. Let me add some data points. We currently use:
226 in codfw. The higher number in eqiad is indeed because we have more nodes in eqiad currently. But we do have a decom task for 18 of them so we 'll soon be in similar numbers.

Ok. That roughly tallies with what I see in BGP although there is a slight difference.

cmooney@re1.cr1-eqiad> show route protocol bgp terse aspath-regex "^64601$" | except ">" | match "10\." | count 
Count: 190 lines

Anyway yeah we're close to the 256 /26 blocks in each so we probably should not delay here.

In terms of the way forward it seems we will have to expand, and we are agreed on what the eventual larger ranges will be after that. Should we go ahead and prep the Homer changes in that case (seems to me they can be done any time without affecting current setup)? Or are there more unknowns and best to hold off until the full plan has been teased out?

This did bite us again and we had to T380473: Decommission parse20[01-20] in a hurry.
Quick fix to free up IPAM blocks without decom is to stop puppet and kubelet on the old nodes, then kubectl delete them and wait a not defined number of minutes for calico to garbage collect the blocks. The latter can probably be expedited by restarting calico-kube-controller, but I'm not ultimately sure.
What definitely works but feels a bit 🔨 is deleting the ipamblock and blockaffinity objects that are assigned to the to be decommissioned nodes.

Now, for the actual change. The biggest issue is coordinating the rollout of changing this and this. We 've never done it on a live cluster with hundreds of nodes and thousands of requests. Hence the "stalling".

The first this is used to configure kube-proxy and it only supports one CIDR per IP family. It is used to detect if traffic comes from this clusters pods, so I *think* we can just extend that to /17 and add a second IPPool to the calico deployment in a second step.

Change #1094489 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] k8s: temp. enforce maximum cluster size

https://gerrit.wikimedia.org/r/1094489

Change #1094489 merged by CDanis:

[operations/puppet@production] k8s: temp. enforce maximum cluster size

https://gerrit.wikimedia.org/r/1094489

Mentioned in SAL (#wikimedia-operations) [2024-11-25T14:19:40Z] <claime> disable puppet and kubelet on wikikube-worker13[13-28].eqiad.wmnet for ip exhaustion T375845

Mentioned in SAL (#wikimedia-operations) [2024-11-25T14:20:29Z] <claime> Manually deleting wikikube-worker13[13-20].eqiad.wmnet for ip exhaustion T375845

Change #1097392 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Revert "wikikube: Add wikikube-worker13[13-28]"

https://gerrit.wikimedia.org/r/1097392

Change #1097392 merged by Clément Goubert:

[operations/puppet@production] Revert "wikikube: Add wikikube-worker13[13-28]"

https://gerrit.wikimedia.org/r/1097392

We're not expecting any more replacements/expansions for wikikube this FY. So we can switch to the /17 with the T341984: Update Kubernetes clusters to 1.31

Change #1121438 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Update policy for K8s BGP to allow a wider range of v4 prefixes

https://gerrit.wikimedia.org/r/1121438

Change #1121438 merged by Cathal Mooney:

[operations/homer/public@master] Update policy for K8s BGP to allow a wider range of v4 prefixes

https://gerrit.wikimedia.org/r/1121438

FYI I've updated the prefix-list on our switches and routers in eqiad/codfw from the old /18 to the wider /17 network.

So whenever we have wikikube hosts announcing ranges from the upper-half of the /17 they'll be accepted ok.

JMeybohm raised the priority of this task from Medium to High.Mar 5 2025, 6:10 PM

Change #1161930 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] Update codfw eqiad pod ip range

https://gerrit.wikimedia.org/r/1161930

Change #1161948 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] admin_ng: Change codfw pod ip range to 10.194.128.0/17

https://gerrit.wikimedia.org/r/1161948

Change #1161930 merged by Kamila Součková:

[operations/puppet@production] Update codfw pod ip range

https://gerrit.wikimedia.org/r/1161930

Change #1161948 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Change codfw pod ip range to 10.194.128.0/17

https://gerrit.wikimedia.org/r/1161948

Is there anything remaining to do on this task? Looks like we have enough space now after the change in the pod ip range?

Is there anything remaining to do on this task? Looks like we have enough space now after the change in the pod ip range?

Not for you, no. We have the outstanding eqiad cluster upgrade plus ip range change scheduled for after the next switchover.

Change #1191647 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] admin_ng: Change eqiad pod ip range to 10.67.128.0/17

https://gerrit.wikimedia.org/r/1191647

Change #1191652 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Update eqiad pod ip range

https://gerrit.wikimedia.org/r/1191652

Change #1191671 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] taskgen: Update calico IPPool check

https://gerrit.wikimedia.org/r/1191671

Change #1191652 merged by Clément Goubert:

[operations/puppet@production] Update eqiad pod ip range

https://gerrit.wikimedia.org/r/1191652

Change #1191647 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Change eqiad pod ip range to 10.67.128.0/17

https://gerrit.wikimedia.org/r/1191647

Clement_Goubert claimed this task.
Clement_Goubert subscribed.

This has now been fixed with the upgrade in T405703

Change #1191671 merged by Clément Goubert:

[operations/puppet@production] taskgen: Update calico IPPool check

https://gerrit.wikimedia.org/r/1191671

Change #1199848 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] k8s::cluster_config: Update max number of hosts

https://gerrit.wikimedia.org/r/1199848

Change #1199848 merged by Kamila Součková:

[operations/puppet@production] k8s::cluster_config: Update max number of hosts

https://gerrit.wikimedia.org/r/1199848