Page MenuHomePhabricator

Re-evaluate ip pools for ml-serve-{eqiad,codfw}
Closed, ResolvedPublic

Description

We have currently two ip pools assigned for each DC/cluster:

  • a /24 subnet (254 IPs) for K8s svc ips
  • a /23 subnet (510 IPs) for K8s pod ips

When we allocated the ranges it was not entirely known how Knative and Istio worked, so we used the standard k8s configuration. We have recently discovered that Knative revisions, created upon each change of the InferenceService resources (basically a deployment for the ml-team), hold a svc IP address until they are cleaned up. We applied a change to limit the amount of non active revisions to keep to three (to allow the use of Knative features like incremental rollout, canary and A/B testing, etc..) but just to support ORES models we'll have to allocate ~ 100 pods, that may translate into 300 svc IP allocations very easily.

This task should evaluate new IP ranges (if possible) for the ML use case, and apply then new subnets to Calico's IPPool's settings (even in an invasive way, we are not live yet).

Maybe the Pod IPs could keep their /23, but the svc pool would probably be good in a /22 (to have extra room for experiments etc..). Any thought?

Last but not the least, we should come up with the procedure to change the Calico's IPPools.

Event Timeline

On ml-serve-eqiad (half way through loading ORES pods):

root@deploy1002:~# kubectl get svc -A  |grep 10. | wc -l
200

Is there currently any kind of auto-expire/auto-clean of old revisions? If not, does kserve have such functionality built-in somewhere? That might tide us over until we have a good plan for migrating to a bigger pool. I figure that in the long term, model #s will only increase, so we need a bigger pool even with swift expiry of old versions.

Merged this morning https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764799

If this works we should keep only 2 revisions for each pod, we can revisit the cleaning policy anytime, I hope it works :)

@akosiaris we'd need an expert opinion on this :)

Afaics from puppet, we configure the svc ip range in kube-api's defaults, and from https://github.com/kubernetes/kubernetes/issues/104088 it seems that kubernetes doesn't support multiple ones (even if the docs for --service-cluster-ip-range seem to mention Max of two dual-stack CIDRs is allowed.). Should we allocate a separate /22 and wipe the ml-clusters (or at least, partially on etcd etc..) to reallocate IPs on the new range? Or is there a better option?

On ml-serve-eqiad (half way through loading ORES pods):

root@deploy1002:~# kubectl get svc -A  |grep 10. | wc -l
200

wow... svc IPs exceeding the number of pods is a scenario I had never imagined.

@akosiaris we'd need an expert opinion on this :)

Afaics from puppet, we configure the svc ip range in kube-api's defaults, and from https://github.com/kubernetes/kubernetes/issues/104088 it seems that kubernetes doesn't support multiple ones (even if the docs for --service-cluster-ip-range seem to mention Max of two dual-stack CIDRs is allowed.). Should we allocate a separate /22 and wipe the ml-clusters (or at least, partially on etcd etc..) to reallocate IPs on the new range? Or is there a better option?

We never had to do this (change service-cluster-ip-range) and as far as I know it is not a supported operation. There exists https://github.com/kubernetes-retired/kube-aws/issues/1307 which describes a similar (actually a bit even more involved operation) but it is weird already. Theoretically speaking, we could try deleting all services, stop all daemons, and restart kube-apiserver with the new pool (it should in theory only have to update 1 single internal k8s object in etcd, a RangeAllocation) and see what happens but the safe way is probably to reinitialize the cluster as you say. If I had the time, I 'd do both, just in the interest of learning something, but I 'd rely on the cluster reinitialization for actually getting to a state I would trust.

On the plus side, now that some experience has been gained with kubeflow, it's also a good time to evaluate if the amount of pods IPs is sufficient for future goals and resize that pool too if needed.

Thanks a lot Alex, I'll open a task to reinit both clusters :( (no idea how to do it, will document myself)

The use case of kserve is that every model will get its own pod (one ore more depending on the extra features added, like transformers/explainers that have extra dedicated pods). In addition to that, knative will assign one or more svc IPs to define routes (default, canary, etc..). If we keep the history of knative revisions to max 2 for each pod, we may end up in a lot of svc IPs allocated. So ideally, to be future proof, I'd say that we could:

  • assign a /21 for pods (2,046 IPs)
  • assign a /20 for svcs (4,094 IPs)

There should be enough space in the /18 subnets that we assigned for K8s (one for eqiad and one for codfw, ipv4/ipv6). I am wondering though if we prefer to allocate something like a /18 for k8s ML, separating the use case from Wikikube. In the near future we'll also have to create IP space for the ML/Data-Engineering cluster, that will run various workloads (Kubeflow and some Analytics-related jobs like Airflow mostly), so maybe dedicated /18s or less for each cluster could be the way to go to avoid stepping on each other's toes.

Lemme know your thoughts :)

Thanks a lot Alex, I'll open a task to reinit both clusters :( (no idea how to do it, will document myself)

We have it layed out here: T277191 with quite a bit of detail (including the commands to clear out etcd). But it would be good to document it somewhere better. We 've been also meaning to make it into a cookbook, cause this is going to be a recurring process.

The use case of kserve is that every model will get its own pod (one ore more depending on the extra features added, like transformers/explainers that have extra dedicated pods). In addition to that, knative will assign one or more svc IPs to define routes (default, canary, etc..). If we keep the history of knative revisions to max 2 for each pod, we may end up in a lot of svc IPs allocated. So ideally, to be future proof, I'd say that we could:

  • assign a /21 for pods (2,046 IPs)
  • assign a /20 for svcs (4,094 IPs)

There should be enough space in the /18 subnets that we assigned for K8s (one for eqiad and one for codfw, ipv4/ipv6). I am wondering though if we prefer to allocate something like a /18 for k8s ML, separating the use case from Wikikube. In the near future we'll also have to create IP space for the ML/Data-Engineering cluster, that will run various workloads (Kubeflow and some Analytics-related jobs like Airflow mostly), so maybe dedicated /18s or less for each cluster could be the way to go to avoid stepping on each other's toes.

There are 2 /16s in netbox that are reserved for kubernetes (I predicted increase of usage back in early 2021). Those are 10.67.0.0/16 and 10.194.0.0/16 respectively for eqiad/codfw (I did no such reservation for IPv6, we have enough of IPv6 addresses anyway). That's 4 /18s, feel free to pick one and split it up in svc and pods pools as you see fit.

Adding @ayounsi and @cmooney in case that plan (and reservation) was flawed.

@elukey detailed me the situation over IRC, thanks!

@akosiaris those reserved prefixes make sens to me and you're welcome to sub-allocate them.

Out of curiosity, could we keep the pod IPs dual-stacked, but use IPv6 only for services IPs? As all our prod servers are dual stacked, and unless special cases, it should just works?

@elukey detailed me the situation over IRC, thanks!

@akosiaris those reserved prefixes make sens to me and you're welcome to sub-allocate them.

Many thanks!

Out of curiosity, could we keep the pod IPs dual-stacked, but use IPv6 only for services IPs? As all our prod servers are dual stacked, and unless special cases, it should just works?

In the future, yes, but the current version we run everywhere (1.16) IPv6 support is still "alpha" state. We need first to upgrade to 1.20 (the first version where it is fully supported) or 1.21 (the first version where IPv6/IPv4 dual stack is enabled by default) and then start meddling with adding proper IPv6 support to anything else than just our pods.

I put the smaller staging allocation at the end to avoid fragmentation (at least for now, it can't be avoided forever, in my experience). Similar, the Train/DSE range is "flipped" (/21 first) to avoid fragmentation between it and the preceding prod ranges. If there would be sufficiently smaller ranges needed in EQIAD for future projects, they should follow the same scheme as the staging ranges in CODFW (allocate from the end, try to avoid fragmentation in the same alternating-sizes pattern as for prod/train).

@akosiaris @ayounsi if you have time could you please review what Tobias proposed above? If everything is inline with best-practices we (ML) will proceed with the (re)initialization of our clusters :)

Hey,

I've can't see any problem with the above. Avoiding fragmentation is worth doing so we have as sane a plan as possible, but I note the existing ranges are further divided with hosts announcing /26 blocks, so either way the impact on our core routing tables is the same.

Let's see if @ayounsi has any other thoughts but it makes sense to me :)

Nothing to add :)

@elukey good luck with the (re)initialization

Thanks a lot everybody for the help!