Page MenuHomePhabricator

Investigate deployment of gemma2 on LiftWing
Closed, ResolvedPublic3 Estimated Story Points

Description

As an engineer,
I'd like to deploy one of the latest available LLMs from huggingface, gemma-27b-it in order to create a more concrete path of deploying models as they come out.
As part of this investigation I want to figure out:

  • The limitations of our infrastructure when deploying bigger models. What resources shall we provision and when do we run out of resources (cpu throttling, out of memory errors)?
  • In most cases new models need to first be supported by the transformers library and in turn by the huggingfaceserver module of kserve. If an inference optimization engine is used (e.g. vllm) then this needs to be updated as well. This brings additional burden as each of these packages depend on each other and the chain of updates might break if one is not updated in time. For example although transformers package makes often releases the kserve package follows a 6month release cycle. We need to find a streamlined process of updating the huggingface service. That may include us maintaining our own forks but it should be sustainable enough for the team to be able to maintain

Event Timeline

Change #1051806 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy gemma2-27b-it on ml-staging

https://gerrit.wikimedia.org/r/1051806

Change #1051806 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy gemma2-27b-it on ml-staging

https://gerrit.wikimedia.org/r/1051806

Change #1052051 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update hf image

https://gerrit.wikimedia.org/r/1052051

Change #1052051 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update hf image

https://gerrit.wikimedia.org/r/1052051

gemma2-27b-it has been deployed on liftwing staging:

time curl "https://inference-staging.svc.codfw.wmnet:30443/openai/v1/completions" -H "Host: gemma2-27b-it.experimental.wikimedia.org" -H "Content-Type: application/json" -X POST -d '{"model": "gemma2", "prompt": "Write me a poem about Machine Learning.", "stream":false, "max_tokens": 50}'

{"id":"93a2066f-4852-43db-869b-ddb88e1cc18d","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"\n\nIn realms of data, vast and deep,\nWhere patterns hide and secrets sleep,\nA new intelligence takes flight,\nMachine Learning, a beacon bright.\n\nFrom numbers and trends, it learns to see,\nThe hidden truths, the"}],"created":1720084299,"model":"gemma2","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":50,"prompt_tokens":9,"total_tokens":59}}
real	0m8.794s
user	0m0.050s
sys	0m0.005s

Although the model card mentions that if the dtype is not specified the weights will be upcast to float32 (original weights are bfloat16) if we do so we end up with an empty response. I haven't yet figured out why this happens. When running the model without the kserve code (plain transformers with the example posted on the model card) it works fine.
The solution to the above is defining the dtype when we start the service providing the cmd arg --dtype bfloat16 and it runs as expected.

Thanks for this work @isarantopoulos!

I have two questions:

  • Would it be possible to fine-tune a classifier using this deployment? I mean not using the model through Prompts, but directly training for specifics tasks
  • In the case of prompts, would it possible to have interactive sessions? For example, sending a prompt and based on the response give another instruction were the model remembers the previous interaction?

Hey @diego!

  • the specific deployment would only work to serve the model
  • haven't looked into that yet, but it is something that we'll definitely work on. Now I think you'd have to pass the previous conversation, but ideally we'd keep the context on the server side.

These are topics and questions that we'd like to work on but in a second phase. At the moment we are working on how we deploy and serve these types of models, with a focus on improving out how to utilize our hardware (AMD GPUs) and the way we update the service itself as things are moving fast.
We can start adding topics in a backlog so that we collectively decide where to focus. We'll follow up on that!

We have deployed gemma2 using the latest patch release from transformers (this release introduced support for gemma2). Now we are using v4.42.3 which has some newer fixes.

In order to be able to use this version we have used our WMF kserve fork to work around the strict requirements set in the huggingfaceserver described in this issue. We have opened also a related patch to solve this, so that we don't need to wait for a new kserve release each time in order to be able to deploy new models already supported by transformers.