As an engineer,
I'd like to deploy one of the latest available LLMs from huggingface, gemma-27b-it in order to create a more concrete path of deploying models as they come out.
As part of this investigation I want to figure out:
- The limitations of our infrastructure when deploying bigger models. What resources shall we provision and when do we run out of resources (cpu throttling, out of memory errors)?
- In most cases new models need to first be supported by the transformers library and in turn by the huggingfaceserver module of kserve. If an inference optimization engine is used (e.g. vllm) then this needs to be updated as well. This brings additional burden as each of these packages depend on each other and the chain of updates might break if one is not updated in time. For example although transformers package makes often releases the kserve package follows a 6month release cycle. We need to find a streamlined process of updating the huggingface service. That may include us maintaining our own forks but it should be sustainable enough for the team to be able to maintain