Page MenuHomePhabricator

Investigate ModelMesh architecture
Open, Needs TriagePublic

Description

We want to investigate how a model mesh could be implemented on our infrastructure to do multi-model serving.
In the current way that kserve is deployed in LiftWing each model has its own pod, thus scaling although possible increases the needs for resources (CPU, memory) linearly with the number of model servers.
The key benefits of such an approach for us would be:

  • multiple models per pods => less pods => less memory needed
  • it can potentially make it easier to use GPU for inference. The scenario we want to explore is putting the models that require a GPU in the same pod. This depends on the size of the model (if it fits in GPU memory) and how much time it takes to load a model of that size in accordance to the latency requirements that we may have.

The main purpose of the task is for the team to get acquainted with the architecture and decide if this would be a possible future implementation, rather than implement is right away as the feature is still in alpha version.

image.png (704×1 px, 112 KB)

Note: In the documentation it is referenced that at this point only gRPC calls are supported for this version of ModelMesh so we should also follow up on REST support