One of the ways we can help out the community is hosting an open source LLM for them.
Description
Details
Related Objects
Event Timeline
Change 919293 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] LLM: model server example with bloom
Change 919345 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy Bloom-560m model on Lift Wing
Change 919347 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[integration/config@master] inference-services: add bloom pipelines
Change 919347 merged by jenkins-bot:
[integration/config@master] inference-services: add bloom pipelines
Mentioned in SAL (#wikimedia-releng) [2023-05-12T15:52:41Z] <hashar> Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/919347/ | T333861
@MoritzMuehlenhoff Hi! We are trying to host a LLM model on our infrastructure, and one of the candidates is BLOOM. The license is very permissive but it does impose some restrictions (see at the bottom of the page):
https://huggingface.co/spaces/bigscience/license
Context is https://bigscience.huggingface.co/blog/the-bigscience-rail-license
Some restrictions are very high level so I am not 100% sure if the model is ok to use, others come with Apache 2.0 (like https://huggingface.co/bigscience/mt0-base) that is definitely better but I am wondering where to draw the line between "acceptable" and "not acceptable" in this context. Lemme know your thoughts :)
Usual IANAL disclaimer ahead: If this were a software license this would not meet the standard required by OSI. They e.g. cover this in the FAQ at https://opensource.org/faq/#evil and one infamous example is the JSON license (http://www.json.org/license.html) for which https://lwn.net/Articles/707510/ is a nice writeup. That said, restrictions might not be fully enforced (I have no idea if "You agree not to use the Model or Derivatives of the Model" is a binding restriction?)
But in general, my recommendation would be to talk to the actual lawyers/Legal next. The policies for deploying LLM models in our infrastructure are uncharted territory. Random questions:
- Do we expect these to follow a FLOSS license or are the better licences for ML models?
- If we don't require the four freedoms defined for software, what are our expectations for models?
- If a FLOSS license refers to "source code", what is the equivalent for the ML model, full training set data?
- What are our expectations for modifiablity, do we want/can maintain local deviations of a model e.g.?
Maybe Legal can make an assessment whether it's fine to go ahead with BLOOM and then actual policies can evolve as we build things.
Thanks @MoritzMuehlenhoff for your valuable input! We have some way to go until we figure out what we are going to do with licensing regarding models developed elsewhere that we want to deploy on our infra and especially those that we want to make publicly available.
I'm dropping here a (very) interesting blog on these kind of licenses https://huggingface.co/blog/open_rail
Change 919293 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] LLM: model server example with bloom
Change 919345 merged by Elukey:
[operations/deployment-charts@master] ml-services: deploy Bloom-560m model on Lift Wing
Bloom-560m has been deployed on Lift Wing staging in the experimental namespace and can be accessed like this:
curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/bloom-560m:predict -X POST -i -H "Host: bloom-560m.experimental.wikimedia.org" -d '{"prompt": "Once upon a time ", "result_length": 50}'
Takes ~8 seconds to get a response from the above call and goes to 15s if you double the result_length to 100.
Next things to do:
- Improve the inference code and the way that results are generated
- Do some load testing to get a better understanding of how latency relates to input and requested output length.
- Try other models
Change 921363 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] feaT: change bloom model token output sampling
Change 921363 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] feat: change bloom model token output sampling
Change 921366 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: upgrade bloom model with newer image
Change 921366 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: upgrade bloom model with newer image
Change 921368 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] fix: call class attribute
Change 921368 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] fix: call class attribute
Change 921371 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: fix bloom model inference
Change 921371 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: fix bloom model inference
Change 922583 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: deploy bloom-3b model
Added the above patch to deploy bloom-3b model
https://huggingface.co/bigscience/bloom-3b
Since it requires additional resources I also increased the limitRanges, however I don't have access to see available resources (kubectl describe node is forbidden - will open a ticket to sre).
@elukey In the meantime is this going to be ok? (16GB for this pod)
Additionally I want to add a separate deployment of the 560m model with more resources to see if it helps to reduce latency.
Change 922583 merged by Elukey:
[operations/deployment-charts@master] ml-services: deploy bloom-3b model
Successfully deployed bloom-3b model.
It can be queried like this:
curl https://inference-staging.svc.codfw.wmnet:30443/v1/models/bloom-3b:predict -X POST -i -H "Host: bloom-3b.experimental.wikimedia.org" -d '{"prompt": "The quick brown fox", "result_length": 10}'
Moving forward we'll have to determine a set of tests to run as a benchmark.
I'm also dropping here some links related to inference optimization
https://huggingface.co/blog/bloom-inference-optimization
https://huggingface.co/blog/bloom-inference-pytorch-scripts