Page MenuHomePhabricator

Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production.
Open, Needs TriagePublic

Event Timeline

Infra

  • Setting up the puppet roles
  • Can't commit puppet roles until the machines are there
  • Reached out to vendor

Software side

  • Spike with flashattention(2?)
  • Hopefully results by this week
calbon renamed this task from Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that uses an inference optimization engine in production. to Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..Aug 13 2024, 2:33 PM
  • GPU hosts are racked but not set up yet
  • Software side slower
  • Continuing work on vllm as an inference optimization engine. There have been updates in kserve that allow to use versions later than 0.4.2 which will solve the version discrepancies between kserve/torch/vllm.
  • GPU hosts are up and running on production!