Page MenuHomePhabricator

Host open source LLM (bloom, etc.) on Lift Wing
Closed, DeclinedPublic

Description

One of the ways we can help out the community is hosting an open source LLM for them.

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+39 -1
machinelearning/liftwing/inference-servicesmain+97 -4
operations/deployment-chartsmaster+44 -26
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+18 -0
machinelearning/liftwing/inference-servicesmain+2 -0
machinelearning/liftwing/inference-servicesmain+1 -0
machinelearning/liftwing/inference-servicesmain+13 -4
machinelearning/liftwing/inference-servicesmain+6 -1
operations/deployment-chartsmaster+30 -0
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+1 -1
machinelearning/liftwing/inference-servicesmain+7 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+16 -0
machinelearning/liftwing/inference-servicesmain+141 -0
integration/configmaster+15 -0
Show related patches Customize query in gerrit

Event Timeline

Change 919293 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] LLM: model server example with bloom

https://gerrit.wikimedia.org/r/919293

Change 919345 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy Bloom-560m model on Lift Wing

https://gerrit.wikimedia.org/r/919345

Change 919347 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[integration/config@master] inference-services: add bloom pipelines

https://gerrit.wikimedia.org/r/919347

Change 919347 merged by jenkins-bot:

[integration/config@master] inference-services: add bloom pipelines

https://gerrit.wikimedia.org/r/919347

@MoritzMuehlenhoff Hi! We are trying to host a LLM model on our infrastructure, and one of the candidates is BLOOM. The license is very permissive but it does impose some restrictions (see at the bottom of the page):

https://huggingface.co/spaces/bigscience/license

Context is https://bigscience.huggingface.co/blog/the-bigscience-rail-license

Some restrictions are very high level so I am not 100% sure if the model is ok to use, others come with Apache 2.0 (like https://huggingface.co/bigscience/mt0-base) that is definitely better but I am wondering where to draw the line between "acceptable" and "not acceptable" in this context. Lemme know your thoughts :)

Usual IANAL disclaimer ahead: If this were a software license this would not meet the standard required by OSI. They e.g. cover this in the FAQ at https://opensource.org/faq/#evil and one infamous example is the JSON license (http://www.json.org/license.html) for which https://lwn.net/Articles/707510/ is a nice writeup. That said, restrictions might not be fully enforced (I have no idea if "You agree not to use the Model or Derivatives of the Model" is a binding restriction?)

But in general, my recommendation would be to talk to the actual lawyers/Legal next. The policies for deploying LLM models in our infrastructure are uncharted territory. Random questions:

  • Do we expect these to follow a FLOSS license or are the better licences for ML models?
  • If we don't require the four freedoms defined for software, what are our expectations for models?
  • If a FLOSS license refers to "source code", what is the equivalent for the ML model, full training set data?
  • What are our expectations for modifiablity, do we want/can maintain local deviations of a model e.g.?

Maybe Legal can make an assessment whether it's fine to go ahead with BLOOM and then actual policies can evolve as we build things.

Thanks @MoritzMuehlenhoff for your valuable input! We have some way to go until we figure out what we are going to do with licensing regarding models developed elsewhere that we want to deploy on our infra and especially those that we want to make publicly available.
I'm dropping here a (very) interesting blog on these kind of licenses https://huggingface.co/blog/open_rail

Change 919293 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] LLM: model server example with bloom

https://gerrit.wikimedia.org/r/919293

Change 919345 merged by Elukey:

[operations/deployment-charts@master] ml-services: deploy Bloom-560m model on Lift Wing

https://gerrit.wikimedia.org/r/919345

Bloom-560m has been deployed on Lift Wing staging in the experimental namespace and can be accessed like this:

curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/bloom-560m:predict -X POST -i -H "Host: bloom-560m.experimental.wikimedia.org" -d '{"prompt": "Once upon a time ", "result_length": 50}'

Takes ~8 seconds to get a response from the above call and goes to 15s if you double the result_length to 100.
Next things to do:

  • Improve the inference code and the way that results are generated
  • Do some load testing to get a better understanding of how latency relates to input and requested output length.
  • Try other models

Change 921363 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] feaT: change bloom model token output sampling

https://gerrit.wikimedia.org/r/921363

Change 921363 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] feat: change bloom model token output sampling

https://gerrit.wikimedia.org/r/921363

Change 921366 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: upgrade bloom model with newer image

https://gerrit.wikimedia.org/r/921366

Change 921366 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade bloom model with newer image

https://gerrit.wikimedia.org/r/921366

Change 921368 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] fix: call class attribute

https://gerrit.wikimedia.org/r/921368

Change 921368 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] fix: call class attribute

https://gerrit.wikimedia.org/r/921368

Change 921371 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix bloom model inference

https://gerrit.wikimedia.org/r/921371

Change 921371 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix bloom model inference

https://gerrit.wikimedia.org/r/921371

Change 922583 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy bloom-3b model

https://gerrit.wikimedia.org/r/922583

Added the above patch to deploy bloom-3b model
https://huggingface.co/bigscience/bloom-3b

Since it requires additional resources I also increased the limitRanges, however I don't have access to see available resources (kubectl describe node is forbidden - will open a ticket to sre).
@elukey In the meantime is this going to be ok? (16GB for this pod)
Additionally I want to add a separate deployment of the 560m model with more resources to see if it helps to reduce latency.

Change 922583 merged by Elukey:

[operations/deployment-charts@master] ml-services: deploy bloom-3b model

https://gerrit.wikimedia.org/r/922583

Successfully deployed bloom-3b model.
It can be queried like this:

curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/bloom-3b:predict -X POST -i -H "Host: bloom-3b.experimental.wikimedia.org" -d '{"prompt": "The quick brown fox", "result_length": 10}'

Moving forward we'll have to determine a set of tests to run as a benchmark.
I'm also dropping here some links related to inference optimization
https://huggingface.co/blog/bloom-inference-optimization
https://huggingface.co/blog/bloom-inference-pytorch-scripts

This is interesting issue related to resources (cpu memory) needed when loading a model. It loads the model 2 times in memory resulting in huge requirements. For example while loading falcon-7b model which is ~15GB , by doing memory profiling on the script I use for inference 30Gb or RAM are used.
Using the flag low_cpu_mem_usage=True when loading the models cuts this approx to half 14-15GB.
The issue is that while generating outputs something similar happens and memory usage skyrockets to 30GB.
I'm pasting some profiling examples before and after using the aforementioned flag.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8        82.8 MiB     82.8 MiB           1   @profile
     9                                         def predict():
    10     82.8 MiB      0.0 MiB           1       model_path = "falcon-7b"
    11  30583.4 MiB  30500.7 MiB           1       model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
    12  30624.2 MiB     40.7 MiB           1       tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
    13  30624.2 MiB      0.0 MiB           1       prompt = "Once upon a time "
    14  30624.2 MiB      0.0 MiB           1       result_length = 100
    15  30624.2 MiB      0.0 MiB           1       start = time.time()
    16  30625.4 MiB      1.2 MiB           1       inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
    17  30625.4 MiB      0.0 MiB           1       result_length = inputs["input_ids"].size()[1] + result_length
    25  30625.4 MiB  -1164.9 MiB           2       outputs = model.generate(inputs["input_ids"],
    26  30625.4 MiB      0.0 MiB           1                              max_length=result_length,
    27  30625.4 MiB      0.0 MiB           1                              do_sample=True,
    28  30625.4 MiB      0.0 MiB           1                              top_k=50,
    29  30625.4 MiB      0.0 MiB           1                              top_p=0.9
    30                                                                   )
    31                                         
    32                                             # print(tokenizer.decode(outputs[0]))
    33  29460.6 MiB  -1164.8 MiB           1       response = tokenizer.decode(outputs[0])
    34  29460.6 MiB      0.0 MiB           1       print(time.time() - start)
    35  29460.6 MiB      0.0 MiB           1       print(response)
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     8     84.2 MiB     84.2 MiB           1   @profile
     9                                         def predict():
    10     84.2 MiB      0.0 MiB           1       model_path = "falcon-7b"
    11  10467.2 MiB  10383.0 MiB           2       model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True,
    12     84.2 MiB      0.0 MiB           1                                                    trust_remote_code=True, low_cpu_mem_usage=True)
    13  10507.0 MiB     39.9 MiB           2       tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True,
    14  10467.2 MiB      0.0 MiB           1                                                 low_cpu_mem_usage=True)
    15  10507.0 MiB      0.0 MiB           1       prompt = "Once upon a time "
    16  10507.0 MiB      0.0 MiB           1       result_length = 100
    17  10507.0 MiB      0.0 MiB           1       start = time.time()
    18  10507.8 MiB      0.7 MiB           1       inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
    19  10507.8 MiB      0.0 MiB           1       result_length = inputs["input_ids"].size()[1] + result_length
    20                                         
    21  30124.1 MiB  19616.3 MiB           2       outputs = model.generate(inputs["input_ids"],
    22  10507.8 MiB      0.0 MiB           1                              max_length=result_length,
    23  10507.8 MiB      0.0 MiB           1                              do_sample=True,
    24  10507.8 MiB      0.0 MiB           1                              top_k=50,
    25  10507.8 MiB      0.0 MiB           1                              top_p=0.9
    26                                                                   )
    27                                         
    28  30124.1 MiB      0.0 MiB           1       response = tokenizer.decode(outputs[0])

Will continue this investigation and will try to deploy falcon-7b-instuct which seems like a strong ready-to-use base model compared to falcon-7b and other smaller versions of open llms

Change 926507 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] feat: reduce llm memory footprint

https://gerrit.wikimedia.org/r/926507

Change 926507 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] feat: reduce llm memory footprint

https://gerrit.wikimedia.org/r/926507

Change 927611 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy LLM model falcon-7b-instruct

https://gerrit.wikimedia.org/r/927611

Change 927733 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] fix: add missing requirements for falcon-7b model and enable GPU support

https://gerrit.wikimedia.org/r/927733

Change 927733 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] fix: add missing requirements for falcon-7b model and enable GPU support

https://gerrit.wikimedia.org/r/927733

Change 927983 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] blubber: add libdrm-amdgpu1 to bloom's docker image

https://gerrit.wikimedia.org/r/927983

Change 927983 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] blubber: add libdrm-amdgpu1 to bloom's docker image

https://gerrit.wikimedia.org/r/927983

Change 927993 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] feat: add spawn method for cpu and gpu

https://gerrit.wikimedia.org/r/927993

Change 927993 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] feat: add spawn method for cpu and gpu

https://gerrit.wikimedia.org/r/927993

We have various errors at the moment, but this one seems the issue when bootstrapping bloom-3b:

Traceback (most recent call last):
  File "/srv/bloom/model-server/model.py", line 66, in <module>
    kserve.ModelServer(workers=1).start([model])
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 148, in start
    asyncio.run(servers_task())
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 146, in servers_task
    await asyncio.gather(*servers)
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 140, in serve
    server.start()
  File "/usr/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/opt/lib/python/site-packages/torch/multiprocessing/reductions.py", line 261, in reduce_tensor
    event_sync_required) = storage._share_cuda_()
  File "/opt/lib/python/site-packages/torch/storage.py", line 943, in _share_cuda_
    return self._untyped_storage._share_cuda_(*args, **kwargs)
RuntimeError: HIP error: invalid device pointer
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Change 928076 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: add gpu support for bloom-560m model

https://gerrit.wikimedia.org/r/928076

Change 928076 merged by Elukey:

[operations/deployment-charts@master] ml-services: add gpu support for bloom-560m model

https://gerrit.wikimedia.org/r/928076

Change 928085 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: debug HIP for AMD GPU usage

https://gerrit.wikimedia.org/r/928085

Change 928578 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update bloom image

https://gerrit.wikimedia.org/r/928578

Change 928578 merged by Elukey:

[operations/deployment-charts@master] ml-services: update bloom image

https://gerrit.wikimedia.org/r/928578

We have successfully deployed bloom-560m with and without GPU on LiftWing 🎉
Preliminary results show an out-of-the-box (without additional inference optimization) of 7x-10x reduced latencies. For example for a requested result length of 200 tokens the cpu version takes ~45seconds while th gpu one 4s.
A more detailed comparison will follow.

Change 928085 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: debug HIP for AMD GPU usage

Reason:

not needed

https://gerrit.wikimedia.org/r/928085

Change 927611 merged by Elukey:

[operations/deployment-charts@master] ml-services: deploy LLM model falcon-7b-instruct with GPU

https://gerrit.wikimedia.org/r/927611

Change 930847 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: add the ability for facilitate various Open Source LLMs

https://gerrit.wikimedia.org/r/930847

Change 931290 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deploy NLLB model

https://gerrit.wikimedia.org/r/931290

Change 930847 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: add the ability for facilitate various Open Source LLMs

https://gerrit.wikimedia.org/r/930847

Change 931290 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy NLLB model

https://gerrit.wikimedia.org/r/931290

Change 931575 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: fix nllb default input parameter

https://gerrit.wikimedia.org/r/931575

Change 931575 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix nllb default input parameter

https://gerrit.wikimedia.org/r/931575

The model https://huggingface.co/facebook/nllb-200-distilled-600M has been deployed on Lift Wing with and without GPU. In this raw form we have an average response of : with GPU ~1-3 seconds - without 3-15 seconds, both depending on input size. Don't fixate on the times as a more detailed analysis will follow.
In order to support more open source LLMs which may use different transformer python model classes we have extended the base LLM class with a new class named NLLB which can be used for machine translation tasks.
I am currently working on supporting loading models suing 8bit integers to better document serving times of the different variants.

@isarantopoulos Did you consider optimizing bloom or nllb for inference? Even if we have GPUs, inference optimization can save lot of compute resources.

In https://people.wikimedia.org/~santhosh/bloom I have uploaded CT2 optimized bloom models - it would be interesting to see the improvment.

Regarding the NLLB model(I am not sure whether it can be called as an LLM), the machine translation service we(language team) host, we use CT2 optimized model. It gives translation under 2seconds without GPU

@santhosh Hi! One side note about how we run NLLB - this is the code that creates a simple model server. For the moment it is just a proof of concept to demonstrate that we can run LLMs etc.. on GPUs, but the gist of it is just to follow a skeleton:

  • preprocess
  • process
  • postprocess

By default there is an extra container that pulls a binary from swift and makes it available on the pod, called storage-initializer. One thing that puzzled me in these months is that MinT could have easily been added to Lift Wing, but for a lot of reasons we decided not to. My worry is that we keep creating solutions to the same problems in different teams, for example the code that pulls a binary from Swift. I am totally aware that moving MinT now from wikikube to Lift Wing may be overkill, but for future services it would be nice to increase the synergies between our teams :)

@santhosh Hi! at the moment we are basically doing POCs with GPUs and LLMs on Lift Wing in order to procure GPUs in the upcoming months. Next quarter we plan to work more on this.
We are aware of the great work done by the content translation team in NLLB with Ctranslate2 (I have done the process and tested locally) but at the moment we are just exploring the out-of-the-box options that transformer based models offer with any models. In that sense we are working on quantization without translating the model first, but will definitely look into it in the future.
Regarding NLLB it is definitely an LLM. The classification of wether a model a Large Language model or not relies on its number of parameters and models with hundreds of millions of parameters to billions fall in that category regardless of their inference time. NLLB also has several variants from 600M up to 3.3B params.

One thing that puzzled me in these months is that MinT could have easily been added to Lift Wing, but for a lot of reasons we decided not to

We had to do this switching from an AWS server(paid) to our own system in the quickest manner(to save monthly bills that exceeded our budget). At one point LiftWing was considered, but Alex suggested to go ahead with our independent service. But looking back I think that was the right call. These days we update the codebase very frequently and we have full control over the reposiory, We deploy several times a week without much dependency to other teams, Our configuration sytem to support hundreds of languages and their combinations with multiple MT models has grown to a complex system based on community feedbacks, we are able to setup dashboards and monitors to do performance finetuning, we were able to run the system in local laptops easily for testing and debugging. We also made it easily hostable by thirdparties as a simple python server- So I guess it is a trade off now. LeftWing could have helped to achieve all of these, but would have demanded lot of time and support from LiftWing team.

But looking back I think that was the right call. These days we update the codebase very frequently and we have full control over the reposiory, We deploy several times a week without much dependency to other teams, Our configuration sytem to support hundreds of languages and their combinations with multiple MT models has grown to a complex system based on community feedbacks, we are able to setup dashboards and monitors to do performance finetuning, we were able to run the system in local laptops easily for testing and debugging.

All the above would have been supported easily by Lift Wing, committing to our inference-services repository would have needed some coordination at first but the MinT use case would have been a nice way to evolve Lift Wing's support even more. I had a chat with Alexandros at the time, it wasn't really clear what your team's goals were so an independent service seemed ok. What I am trying to say is that in the future ML and Content Translation should work more together on new services, since there is a lot of overlapping. For example you are now trying to fix the model binary import issue, and Lift Wing resolved the problem from the start.
Kserve is also very easy to test locally, see https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe.

We also made it easily hostable by thirdparties as a simple python server- So I guess it is a trade off now. LeftWing could have helped to achieve all of these, but would have demanded lot of time and support from LiftWing team.

The thirdparties use case is definitely interesting, but again I think that the Kserve model server would have worked as well, since it is a simple Python model server that follows a standard path scheme (that is also being standardized across multiple model servers in these months). We'd have happily support the use case :)