Page MenuHomePhabricator

2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU
Open, Needs TriagePublic

Event Timeline

  • GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.

Update: We have Mistral-7b-instruct hosted on ml-staging that uses a CPU and is using the pytorch base image that we have created. A simple request takes approx 30s (haven't run extensive tests yet).
We are facing some issues using the GPU with this docker image at the moment as documented in T362984: GPU errors in hf image in ml-staging.

Decision point: Do we upgrade ROCm drivers?

Aiko is getting up to speed with how HF set up the interference endpoints and maybe can do adapt into our own HF server.

We have a theory that the ROCm drivers on the debian package is not required.

Update:

  • Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
  • Chris's guess is ml-staging installed at end of quarter

Update:

As part of the task T362984: GPU errors in hf image in ml-staging we have also experimented with different versions of pytorch (2.2.1, 2.3.0) and rocm (5.6, 5.7, 6.0) and we are still hitting the same issue.
To clarify the GPU works properly with pytorch 2.0.1 and rocm 5.4.2 but these versions are too old to be used with the huggingfaceserver.

Update:

  • Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
  • Next step is to upgrade ml-staging to Bookworm then test.
  • Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
  • Goal is to utilize GPU so we can deploy models from HuggingFace.
  • Mistral crashlooping, startup checks usually 5m , so we bumped to 10m, but it didn't help
  • Bert model works, so likely Mistral issue
  • the kubelet partition increase for the install phase is in review
  • ml-staging1001 is now on Bookworm, dragonfly (distributed downloading of S3 stuff) needs to be bumped
  • with bookworm, there no longer are GPU drivers on the base node (besides Debian kernel support), but driver/library code lives in the Docker images