Page MenuHomePhabricator

isarantopoulos (Ilias Sarantopoulos)
Machine Learning/MLOps Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 1 2022, 12:34 PM (85 w, 4 d)
Availability
Available
LDAP User
Ilias Sarantopoulos
MediaWiki User
ISarantopoulos-WMF [ Global Accounts ]

Recent Activity

Fri, Jun 21

isarantopoulos added a comment to T354870: Deploy 7b parameter models from HF.

Following up on the vllm support issue from https://phabricator.wikimedia.org/T365246#9826503
with the installation of the new MI210 which seems to be supported by vllm (according to official docs) we are now in a place to test the vllm backend with huggingface.
Following the official docs I am exploring 2 alternatives:

Fri, Jun 21, 3:32 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T360455: Add Article Quality Model to LiftWing.

Preprocessing now works. For this POC we used the following endpoint https://en.wikipedia.org/w/rest.php/v1/revision/12345/html:

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345, "lang": "en"}' -H  "Host: articlequality.experimental.wikimedia.org"
{"rev_id":12345,"lang":"en","normalized_features":[0.25854384507523304,0.0,0.27142566329646467,0.0,0.0,0.0,0.0,false,false]}

Before we move this to production we should also figure out the way to use it with the Rest Gateway.
Now the only thing that is left is to load the model and run the above features through the predict function.

Fri, Jun 21, 1:55 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
isarantopoulos committed rMLIS9cf65791c109: articlequality: add FORCE_HTTP env var.
articlequality: add FORCE_HTTP env var
Fri, Jun 21, 1:28 PM
isarantopoulos committed rMLIS2ace8de4eeae: articlequality: add force_http option.
articlequality: add force_http option
Fri, Jun 21, 10:10 AM

Thu, Jun 20

isarantopoulos added a comment to T354870: Deploy 7b parameter models from HF.

Got the same error with vllm==0.4.3 so I'll try to follow the documentation and see if anyone else is experiencing this issue.

WARNING 06-20 15:46:14 config.py:1155] Casting torch.bfloat16 to torch.float16.
INFO 06-20 15:46:14 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/mnt/models)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:15 selector.py:56] Using ROCmFlashAttention backend.
INFO 06-20 15:46:35 model_runner.py:146] Loading model weights took 14.9595 GB
2024-06-20 15:46:35.322 1 kserve ERROR [__main__.py:<module>():254] Failed to start model server: name 'vllm_ops' is not defined
Thu, Jun 20, 3:48 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T354870: Deploy 7b parameter models from HF.

I bumped into a fork of the vllm project by ROCm which also has different releases as well as a flash-attention implementation for ROCm.
I'm trying vllm 0.4.3 and if it fails I'll go with the official instructions for [ https://docs.vllm.ai/en/latest/getting_started/amd-installation.html | vllm and ROCm ]]. They recommend we build vllm after we install torch-rocm which we are doing since we're using the base image and the only difference in the requirements-rocm.txt is that it requires ray==2.10.0 which we already have (pytest-asyncio as well but I doubt that is being needed to run anything). I'll try both ways and provide an update.

Thu, Jun 20, 3:42 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos committed rMLIS1b0ac4aa9215: huggingface: bump vllm to 0.4.3.
huggingface: bump vllm to 0.4.3
Thu, Jun 20, 3:25 PM
isarantopoulos created P65221 (An Untitled Masterwork).
Thu, Jun 20, 9:52 AM

Wed, Jun 19

isarantopoulos closed T366015: Add pydantic validation to revertrisk model in liftwing-python package as Resolved.
Wed, Jun 19, 1:00 PM · Machine-Learning-Team
isarantopoulos closed T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0), a subtask of T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU, as Resolved.
Wed, Jun 19, 1:00 PM · Goal, Machine-Learning-Team
isarantopoulos closed T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) as Resolved.
Wed, Jun 19, 1:00 PM · Machine-Learning-Team
isarantopoulos closed T365842: Allow setting huggingfaceserver cmd args from deployment-charts as Resolved.
Wed, Jun 19, 1:00 PM · Machine-Learning-Team
isarantopoulos moved T358744: Deploy RR-language-agnostic batch version to prod from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Wed, Jun 19, 12:59 PM · Machine-Learning-Team
isarantopoulos moved T366250: Test Revert Risk model with the transparent config from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Wed, Jun 19, 12:59 PM · Machine-Learning-Team
isarantopoulos moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Wed, Jun 19, 12:55 PM · Machine-Learning-Team
isarantopoulos added a member for Machine-Learning-Team: klausman.
Wed, Jun 19, 12:51 PM
isarantopoulos added a member for Machine-Learning-Team: achou.
Wed, Jun 19, 12:50 PM
isarantopoulos added a member for Machine-Learning-Team: isarantopoulos.
Wed, Jun 19, 12:50 PM
isarantopoulos added a comment to T354870: Deploy 7b parameter models from HF.

I have deployed llama3-8B-instruct on ml-staging.
making a request using the OpenAI API completions endpoint:

Wed, Jun 19, 8:06 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

Huggingface image is now shipped with v0.13.0 of kserve and this is the one we are using. This task is considered done and this is the summary:

Wed, Jun 19, 7:44 AM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a comment to T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

The model server has been successfully upgrade to kserve v0.13.0 and uses the pytorch 2.3.0 - rocm 6.0 base image.

Wed, Jun 19, 7:26 AM · Machine-Learning-Team

Tue, Jun 18

isarantopoulos moved T366772: Solve revscoring models increased latencies for big revision sizes from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Tue, Jun 18, 3:49 PM · Machine-Learning-Team
isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

liftwing package version 0.1.0 has been released on PyPI - https://pypi.org/project/liftwing/

Tue, Jun 18, 12:09 PM · Goal, Machine-Learning-Team

Mon, Jun 17

isarantopoulos added a comment to T360455: Add Article Quality Model to LiftWing.

I'm trying to migrate the following example request to be used by the service:

Mon, Jun 17, 4:38 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
isarantopoulos added a comment to T360455: Add Article Quality Model to LiftWing.

Dummy version has been deployed on ml-staging-codfw experimental. It is just a dummy service that returns the json input passed in the POST request.

curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/articlequality:predict" -X POST -d '{"rev_id": 12345}' -H  "Host: articlequality.experimental.wikimedia.org"
Mon, Jun 17, 12:40 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

Fri, Jun 14

isarantopoulos committed rMLIS33d1e6ae6707: ci: add blubber for articlequality.
ci: add blubber for articlequality
Fri, Jun 14, 2:09 PM
isarantopoulos updated the task description for T367293: Update blubber version in docker images.
Fri, Jun 14, 9:28 AM · Machine-Learning-Team

Thu, Jun 13

isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

All revscoring models have been added in the attached Pull Request and will be included in v0.1 of the package.

Thu, Jun 13, 4:49 PM · Goal, Machine-Learning-Team
isarantopoulos added a comment to T367048: Investigate kserve 0.13.0 upgrade.

I have manually tested the new kserve version with the huggingface image and the bert model that is deployed in the experimental namespace.

Thu, Jun 13, 4:18 PM · Machine-Learning-Team

Wed, Jun 12

isarantopoulos updated the task description for T367293: Update blubber version in docker images.
Wed, Jun 12, 1:12 PM · Machine-Learning-Team
isarantopoulos updated the task description for T367293: Update blubber version in docker images.
Wed, Jun 12, 1:10 PM · Machine-Learning-Team
isarantopoulos created T367293: Update blubber version in docker images.
Wed, Jun 12, 1:09 PM · Machine-Learning-Team
isarantopoulos committed rMLIS72ec0b1c0cfb: articlequality: initial commit.
articlequality: initial commit
Wed, Jun 12, 12:56 PM
isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

We are focusing on adding all the models available through the API Gateway. The package can work also with the internal endpoints but first we will include the models that are publicly available.

Wed, Jun 12, 9:55 AM · Goal, Machine-Learning-Team
isarantopoulos moved T366015: Add pydantic validation to revertrisk model in liftwing-python package from Ready To Go to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Wed, Jun 12, 9:37 AM · Machine-Learning-Team
isarantopoulos committed rMLISb17060cc8ae5: ci: .gitignore(s) only top level /models dir.
ci: .gitignore(s) only top level /models dir
Wed, Jun 12, 7:43 AM

Tue, Jun 11

isarantopoulos closed T366801: Use local tls proxy for Lift Wing staging (inference-staging) as Resolved.
Tue, Jun 11, 2:48 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos moved T366801: Use local tls proxy for Lift Wing staging (inference-staging) from Unsorted to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Tue, Jun 11, 2:45 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos closed T366772: Solve revscoring models increased latencies for big revision sizes as Declined.

Covered by other tasks (https://phabricator.wikimedia.org/T363336)

Tue, Jun 11, 2:45 PM · Machine-Learning-Team
isarantopoulos moved T367048: Investigate kserve 0.13.0 upgrade from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Jun 11, 2:41 PM · Machine-Learning-Team
isarantopoulos claimed T367048: Investigate kserve 0.13.0 upgrade.
Tue, Jun 11, 2:40 PM · Machine-Learning-Team
isarantopoulos set the point value for T367048: Investigate kserve 0.13.0 upgrade to 3.
Tue, Jun 11, 2:40 PM · Machine-Learning-Team
isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

We made request validation optional and it is now really simple to add support for a new model to the package.
Have also added metadata (optional again) for each model. The user can get the list of available models by running python -m liftwing - relevant PR

Tue, Jun 11, 2:00 PM · Goal, Machine-Learning-Team
isarantopoulos added a comment to T360455: Add Article Quality Model to LiftWing.

Adding a todo list of tasks:

Tue, Jun 11, 12:55 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
isarantopoulos committed rMLIS0df3628a6a54: huggingface: kserve 0.13.0.
huggingface: kserve 0.13.0
Tue, Jun 11, 9:48 AM
isarantopoulos edited subtasks for T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services, added: Unknown Object (Task); removed: T366772: Solve revscoring models increased latencies for big revision sizes, T365971: Tweak partman recipe for ML k8s workers.
Tue, Jun 11, 6:43 AM · Goal, Machine-Learning-Team
isarantopoulos removed a parent task for T365971: Tweak partman recipe for ML k8s workers: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Tue, Jun 11, 6:43 AM · Machine-Learning-Team
isarantopoulos removed a parent task for T366772: Solve revscoring models increased latencies for big revision sizes: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Tue, Jun 11, 6:43 AM · Machine-Learning-Team
isarantopoulos removed a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency.
Tue, Jun 11, 6:43 AM · Machine-Learning-Team
isarantopoulos edited parent tasks for T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency, added: Unknown Object (Task); removed: T366772: Solve revscoring models increased latencies for big revision sizes.
Tue, Jun 11, 6:43 AM · Machine-Learning-Team
isarantopoulos removed a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: Unknown Object (Task).
Tue, Jun 11, 6:42 AM · Machine-Learning-Team

Mon, Jun 10

isarantopoulos created T367048: Investigate kserve 0.13.0 upgrade.
Mon, Jun 10, 12:18 PM · Machine-Learning-Team

Fri, Jun 7

isarantopoulos added a comment to T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency.

Applied multiprocessing to eswiki-damaging and viwiki-reverted but only for large revisions (to avoid cpu throttling)

Fri, Jun 7, 10:26 AM · Machine-Learning-Team

Thu, Jun 6

isarantopoulos created T366801: Use local tls proxy for Lift Wing staging (inference-staging).
Thu, Jun 6, 1:54 PM · Patch-For-Review, Machine-Learning-Team
isarantopoulos added a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency.
Thu, Jun 6, 10:11 AM · Machine-Learning-Team
isarantopoulos edited parent tasks for T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency, added: T366772: Solve revscoring models increased latencies for big revision sizes; removed: Unknown Object (Task).
Thu, Jun 6, 10:11 AM · Machine-Learning-Team
isarantopoulos removed a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency.
Thu, Jun 6, 10:10 AM · Machine-Learning-Team
isarantopoulos edited parent tasks for T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency, added: Unknown Object (Task); removed: T366772: Solve revscoring models increased latencies for big revision sizes.
Thu, Jun 6, 10:10 AM · Machine-Learning-Team
isarantopoulos renamed T366772: Solve revscoring models increased latencies for big revision sizes from Solve revscoring models hanging isvcs for big revision sizes to Solve revscoring models increased latencies for big revision sizes.
Thu, Jun 6, 8:27 AM · Machine-Learning-Team
isarantopoulos added a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: Unknown Object (Task).
Thu, Jun 6, 8:27 AM · Machine-Learning-Team
isarantopoulos added a parent task for T366772: Solve revscoring models increased latencies for big revision sizes: T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
Thu, Jun 6, 7:00 AM · Machine-Learning-Team
isarantopoulos added a subtask for T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: T366772: Solve revscoring models increased latencies for big revision sizes.
Thu, Jun 6, 7:00 AM · Goal, Machine-Learning-Team
isarantopoulos added a subtask for T366772: Solve revscoring models increased latencies for big revision sizes: T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency.
Thu, Jun 6, 6:59 AM · Machine-Learning-Team
isarantopoulos added a parent task for T349274: Apply multi-processing to preprocess() in isvcs that suffer from high latency: T366772: Solve revscoring models increased latencies for big revision sizes.
Thu, Jun 6, 6:59 AM · Machine-Learning-Team
isarantopoulos created T366772: Solve revscoring models increased latencies for big revision sizes.
Thu, Jun 6, 6:56 AM · Machine-Learning-Team

Wed, Jun 5

isarantopoulos committed rMLIS53511a2a3559: revscoring_model: inspect mw-api-cache for MP preprocess (authored by elukey).
revscoring_model: inspect mw-api-cache for MP preprocess
Wed, Jun 5, 12:49 PM
isarantopoulos added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.
  • Identified that the latency issues are caused by revscoring preprocessing code when scoring large revisions. {T363336#9850901}. The team is focused on tackling the issue by enabling multiprocessing for problematic model servers and/or limiting the content passed to revscoring.
Wed, Jun 5, 8:47 AM · Goal, Machine-Learning-Team

Tue, Jun 4

isarantopoulos created P64026 (An Untitled Masterwork).
Tue, Jun 4, 4:19 PM
isarantopoulos added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

We have added request payload validation with pydantic and currently adding more models to the package.

Tue, Jun 4, 3:20 PM · Goal, Machine-Learning-Team
isarantopoulos moved T356045: Test revertrisk-multilingual with GPU from Blocked to Ready To Go on the Machine-Learning-Team board.
Tue, Jun 4, 2:38 PM · Machine-Learning-Team
isarantopoulos moved T333804: Add meaningful access logs to KServe's pods from Blocked to Backlog/Lift Wing on the Machine-Learning-Team board.
Tue, Jun 4, 2:36 PM · Patch-For-Review, Epic, Machine-Learning-Team
isarantopoulos moved T360455: Add Article Quality Model to LiftWing from Unsorted to Ready To Go on the Machine-Learning-Team board.
Tue, Jun 4, 2:32 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
isarantopoulos assigned T366250: Test Revert Risk model with the transparent config to achou.
Tue, Jun 4, 2:28 PM · Machine-Learning-Team
isarantopoulos moved T366250: Test Revert Risk model with the transparent config from Unsorted to In Progress on the Machine-Learning-Team board.
Tue, Jun 4, 2:28 PM · Machine-Learning-Team
isarantopoulos moved T366298: Move all isvcs to the transparent config from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.
Tue, Jun 4, 2:27 PM · Machine-Learning-Team
isarantopoulos moved T360455: Add Article Quality Model to LiftWing from Backlog/Lift Wing to Unsorted on the Machine-Learning-Team board.
Tue, Jun 4, 2:24 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team
isarantopoulos moved T366379: Special:ORESModels doesnt work in night theme from Unsorted to Watching on the Machine-Learning-Team board.
Tue, Jun 4, 2:23 PM · ORES, Machine-Learning-Team, FY2023-24-WE 2.1 Typography and palette customizations

Mon, May 27

isarantopoulos added a comment to T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

After defining --backed=hugginface in the entrypoint command the server starts properly but I'm getting an error when I make a request

Mon, May 27, 5:48 PM · Machine-Learning-Team
isarantopoulos added a comment to T366015: Add pydantic validation to revertrisk model in liftwing-python package.

This is the relevant Pull Request : https://github.com/wikimedia/liftwing-python/pull/5

Mon, May 27, 4:33 PM · Machine-Learning-Team
isarantopoulos moved T364089: Have problem with migrating to LiftWing from ores from Watching to 2023-2024 Q4 Done on the Machine-Learning-Team board.
Mon, May 27, 4:27 PM · Machine-Learning-Team
isarantopoulos created T366015: Add pydantic validation to revertrisk model in liftwing-python package.
Mon, May 27, 4:27 PM · Machine-Learning-Team

Fri, May 24

isarantopoulos created T365842: Allow setting huggingfaceserver cmd args from deployment-charts.
Fri, May 24, 4:54 PM · Machine-Learning-Team
isarantopoulos added a parent task for T365166: Update Pytorch base image to 2.3.0: T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Fri, May 24, 4:47 PM · Machine-Learning-Team
isarantopoulos added a subtask for T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU: T365166: Update Pytorch base image to 2.3.0.
Fri, May 24, 4:47 PM · Goal, Machine-Learning-Team
isarantopoulos added a parent task for T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0): T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.
Fri, May 24, 4:47 PM · Machine-Learning-Team
isarantopoulos added a subtask for T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU: T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).
Fri, May 24, 4:47 PM · Goal, Machine-Learning-Team
isarantopoulos created T365834: Append wikitech link and contact info to revscoring model servers.
Fri, May 24, 4:23 PM · Machine-Learning-Team
isarantopoulos added a comment to T357986: Use Huggingface model server image for HF LLMs.

Task T365253: Allow Kubernetes workers to be deployed on Bookworm fixed the issue mentioned above in ml-staging-codfw. After that bert model works perfect while we're having issues with Mistral (more info in. {9826605} probably related to the lack of full support in vllm for MI100.

 time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/bert:predict" -X POST -d '{"instances": ["The capital of france is [MASK]."] }' -H  "Host: bert.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"predictions":["paris"]}
real	0m1.113s
user	0m0.019s
sys	0m0.008s

Previous requests using CPU were taking ~10s.

Fri, May 24, 4:20 PM · Patch-For-Review, Machine-Learning-Team

May 23 2024

isarantopoulos added a comment to T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

Currently investigating the issue to see if MI 100 (gfx908) is supported by vllm after all. Although documentation mentioned above says that it isn't, there are mentions and PRs that seem to support it.
If it doesn't work we'll have to go with huggingface backend instead of vllm, but we lose a ton of improvements mostly in speed.

May 23 2024, 4:30 PM · Machine-Learning-Team
isarantopoulos added a comment to T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0).

Currently getting a CrashLoopBackoff in the pod with the updated image. However there is something I missed during the update: when it come to ROCm support latest vllm doesn't support MI 100.

Requirements
OS: Linux
May 23 2024, 4:08 PM · Machine-Learning-Team
isarantopoulos committed rMLIS5ab989e8bf29: huggingface: upgrade kserve to 0.13-rc0.
huggingface: upgrade kserve to 0.13-rc0
May 23 2024, 2:52 PM
isarantopoulos awarded T363191: Test if we can avoid ROCm debian packages on k8s nodes a Yellow Medal token.
May 23 2024, 9:24 AM · Machine-Learning-Team
isarantopoulos moved T365166: Update Pytorch base image to 2.3.0 from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.
May 23 2024, 3:34 AM · Machine-Learning-Team
isarantopoulos closed T365166: Update Pytorch base image to 2.3.0 as Resolved.
May 23 2024, 3:34 AM · Machine-Learning-Team

May 22 2024

isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

We had forgotten the .pip dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big by itself).
New image is now 13.5GB and 2.5GB when compressed which allows us to publish it in our docker registry.

May 22 2024, 4:01 PM · Machine-Learning-Team
isarantopoulos moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from Ready To Go to In Progress on the Machine-Learning-Team board.
May 22 2024, 3:59 PM · Machine-Learning-Team
isarantopoulos moved T365166: Update Pytorch base image to 2.3.0 from Ready To Go to In Progress on the Machine-Learning-Team board.
May 22 2024, 3:59 PM · Machine-Learning-Team

May 21 2024

isarantopoulos added a comment to T364089: Have problem with migrating to LiftWing from ores.

@AgnesAbah have you managed to resolve the issue?
As Kosta mentioned there isn't anything there related to Lift Wing but with the MediaWiki Action API.

May 21 2024, 4:19 PM · Machine-Learning-Team
isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages

functorch  
torch  
torch-2.3.0+rocm6.0.dist-info  
torchgen

Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:

May 21 2024, 1:53 PM · Machine-Learning-Team
isarantopoulos added a comment to T365166: Update Pytorch base image to 2.3.0.

Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on:

May 21 2024, 12:29 PM · Machine-Learning-Team