Page MenuHomePhabricator

Upgrade AMD GPU + torch version of ML Labs machines
Closed, ResolvedPublic5 Estimated Story Points

Description

The torch version on ml-lab1001 is 2.4.1+rocm6.1 while the current version is 2.9.1. I'll share my use-case as it likely covers at least a few of the core concerns:

I was seeking to fine-tune a ModernBert-family model (this one but the issues trigger with any model in the family). The issues encountered:

  • That particular model doesn't have a safetensors version so if you try to load the torch version, you get the error below. That can be fixed by referring to a safetensors version that's available in a MR on HuggingFace, but it's kinda hacky:
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434
  • Once you switch to safetensors, you get a #include <Python.h> exception. This can be avoided by doing the following (as suggested by the error message):
import torch._dynamo
torch._dynamo.config.suppress_errors = True
  • Once you get through both of those errors, the model does train but it only spits out 0s. This gets caught when evaluating the model, where you get the following exception (and inspecting the predictions shows that they're all 0s). Some sleuthing suggested that this might be due to the old torch version (details):
ValueError: Input contains NaN

All this can be replicated with /home/isaacj/mmbert-peacock/training_peacock_mmbert.ipynb on ml-lab1001. I was able to get a mBERT model to train just fine so I don't think it's the code/data.

Event Timeline

Tagging you @Trokhymovych as I think you mentioned having (similar?) issues with a Qwen re-ranking model as well that seemed to relate to the torch version?

gkyziridis set the point value for this task to 5.
Update

I think that the issue is the version incompatibility of torch and rocm (pytorch 2.4.1 is very old for this family of models).
Since, ml-lab1001 is gone right now, I am using the ml-lab1002 machine.
I achieved to built a new docker image using the docker-registry.wikimedia.org/bookworm:20251207 as the base image.
I installed torch_2.6.0+rocm6.1 version using this Dockerfile:

1# Base image
2FROM docker-registry.wikimedia.org/bookworm:20251207
3
4# Set root user
5USER root
6
7# Proxy build arguments
8ENV http_proxy=http://webproxy:8080
9ENV https_proxy=http://webproxy:8080
10ENV no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
11
12# Install dependencies
13RUN apt-get update && \
14 apt-get install -y --no-install-recommends \
15 python3.11 python3.11-venv python3.11-distutils python3-pip \
16 build-essential git curl ca-certificates \
17 && rm -rf /var/lib/apt/lists/*
18
19# Set python3.11 as default python
20RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
21
22# Create a virtual environment
23RUN python -m venv /opt/venv
24ENV PATH="/opt/venv/bin:$PATH"
25
26# Upgrade pip inside venv
27RUN pip install --upgrade pip
28
29# Install PyTorch with ROCm support
30# RUN pip install torch==2.9.1+rocm6.1 --index-url https://download.pytorch.org/whl/rocm6.1
31RUN pip install torch==2.6.0+rocm6.1 --index-url https://download.pytorch.org/whl/rocm6.1
32
33# Install libraries
34RUN pip install transformers accelerate safetensors datasets
35
36# Optional: Install common ML packages
37RUN pip install numpy scipy pandas matplotlib scikit-learn
38
39# Set ROCm environment variables
40ENV ROCM_PATH=/opt/rocm
41ENV PATH=$ROCM_PATH/bin:$PATH
42ENV LD_LIBRARY_PATH=$ROCM_PATH/lib:$ROCM_PATH/lib64:$LD_LIBRARY_PATH
43
44# Set working directory
45WORKDIR /workspace
46
47# Default command
48CMD ["python"]

I built the image using: docker build --network=host -t torch_rocm3 .

I ran the docker using: docker run --rm -it --network=host torch_rocm3 (this does not attach any GPU).
I tried to train the model with a short script and test it in the following paste:

1$ docker run --rm -it --network=host torch_rocm3
2
3Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0] on linux
4Type "help", "copyright", "credits" or "license" for more information.
5>>> import torch
6>>> from transformers import (
7... AutoTokenizer,
8... AutoModelForSequenceClassification,
9... Trainer,
10... TrainingArguments
11... )
12>>>
13>>>
14>>> print(torch.__version__)
152.6.0+rocm6.1
16>>>
17>>>
18>>> MODEL = "answerdotai/ModernBERT-base"
19>>> DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
20>>>
21>>> print(f"Device: {DEVICE}")
22Device: cpu
23>>>
24>>>
25>>> tokenizer = AutoTokenizer.from_pretrained(MODEL)
26tokenizer_config.json: 20.8kB [00:00, 53.1MB/s]
27tokenizer.json: 2.13MB [00:00, 115MB/s]
28special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████| 694/694 [00:00<00:00, 4.88MB/s]
29>>>
30>>>
31>>> model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=2)
32Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
33You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
34>>>
35>>>
36>>> texts = ["hello world", "modernbert test"]
37>>> labels = [0, 1]
38>>>
39>>> batch = tokenizer(
40... texts,
41... padding=True,
42... truncation=True,
43... return_tensors="pt"
44... )
45>>> batch["labels"] = torch.tensor(labels)
46>>>
47>>>
48>>> train_data = []
49>>> for i in range(len(labels)):
50... train_data.append({
51... "input_ids": batch["input_ids"][i],
52... "attention_mask": batch["attention_mask"][i],
53... "labels": batch["labels"][i]
54... })
55...
56>>>
57>>> args = TrainingArguments(
58... output_dir="./out",
59... per_device_train_batch_size=2,
60... max_steps=1, # just 1 training step
61... report_to="none",
62... )
63>>> trainer = Trainer(
64... model=model,
65... args=args,
66... train_dataset=train_data
67... )
68>>>
69>>>
70>>> trainer.train()
71{'train_runtime': 1.686, 'train_samples_per_second': 1.186, 'train_steps_per_second': 0.593, 'train_loss': 0.6049032807350159, 'epoch': 1.0}
72100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.69s/it]
73TrainOutput(global_step=1, training_loss=0.6049032807350159, metrics={'train_runtime': 1.686, 'train_samples_per_second': 1.186, 'train_steps_per_second': 0.593, 'total_flos': 6655426680.0, 'train_loss': 0.6049032807350159, 'epoch': 1.0})
74>>>
75>>> print("\nRunning prediction...")
76Running prediction...
77>>>
78>>> with torch.no_grad():
79... outputs = model(batch["input_ids"].to(DEVICE), attention_mask=batch["attention_mask"].to(DEVICE))
80... logits = outputs.logits.cpu()
81>>>
82>>> all_zero = torch.all(logits == 0)
83>>> has_nan = torch.isnan(logits).any()
84>>>
85>>> print("\n=== RESULTS ===")
86>>> print("Logits:")
87>>> print(logits)
88>>> print("All zeros? :", bool(all_zero))
89>>> print("Contains NaNs?", bool(has_nan))
90
91=== RESULTS ===
92Logits:
93tensor([[-0.2067, -1.2938],
94 [-0.9808, 0.4442]])
95All zeros? : False
96Contains NaNs? False

So we can see that using the above image the model returns a normal prediction outcome (not zeros), but this simulation is using CPU, not a GPU.

Attaching GPU to the docker image in ml-lab1002:
This is the Dockerfile.gpu:

1# Base image
2FROM docker-registry.wikimedia.org/bookworm:20251207
3
4# Set root user
5USER root
6
7# Proxy build arguments
8ENV http_proxy=http://webproxy:8080
9ENV https_proxy=http://webproxy:8080
10ENV no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
11
12# Install dependencies including Python headers
13RUN apt-get update && \
14 apt-get install -y --no-install-recommends \
15 python3.11 python3.11-venv python3.11-distutils python3.11-dev python3-pip \
16 build-essential git curl ca-certificates \
17 && rm -rf /var/lib/apt/lists/*
18
19
20# Set python3.11 as default python
21RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
22
23# Create a virtual environment
24RUN python -m venv /opt/venv
25ENV PATH="/opt/venv/bin:$PATH"
26
27# Upgrade pip inside venv
28RUN pip install --upgrade pip
29
30# Install PyTorch with ROCm support
31# RUN pip install torch==2.9.1+rocm6.1 --index-url https://download.pytorch.org/whl/rocm6.1
32RUN pip install torch==2.6.0+rocm6.1 --index-url https://download.pytorch.org/whl/rocm6.1
33
34# Install ML packages
35RUN pip install numpy pandas transformers accelerate safetensors datasets
36
37# Set ROCm environment variables
38ENV ROCM_PATH=/opt/rocm
39ENV PATH=$ROCM_PATH/bin:$PATH
40ENV LD_LIBRARY_PATH=$ROCM_PATH/lib:$ROCM_PATH/lib64:$LD_LIBRARY_PATH
41
42# Set working directory
43WORKDIR /workspace
44
45# Default command
46CMD ["python"]
47
48# Build
49# docker build --network=host -f Dockerfile.gpu -t torch_rocm_gpu .

I built it using: docker build --network=host -f Dockerfile.gpu -t torch_rocm_gpu ., the built was successful.

For running it and attaching the GPUs on it I ran:

$ docker run --rm --network=host -it \
--device=/dev/kfd --device=/dev/dri \
--group-add=$(getent group video | cut -d: -f3) \
--group-add=$(getent group render | cut -d: -f3) \
--ipc=host \
--security-opt seccomp=unconfined \
torch_rocm_gpu

But I am receiving this error: :0:rocdevice.cpp :2881: 15722138585593 us: [pid:1 tid:0x7faccf3ff6c0] Callback: Queue 0x7faa34100000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 , which has to do with the gpu kernel or incompatibility between the rocm version: rocm6.1 with this gpu.
There is also this message:

>>> print(torch.cuda.is_available()) 
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory
True

Here are the results:

1$ docker run --rm --network=host -it \
2--device=/dev/kfd --device=/dev/dri \
3--group-add=$(getent group video | cut -d: -f3) \
4--group-add=$(getent group render | cut -d: -f3) \
5--ipc=host \
6--security-opt seccomp=unconfined \
7torch_rocm_gpu
8
9# Python 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0] on linux
10# Type "help", "copyright", "credits" or "license" for more information.
11import os
12
13os.environ["TORCH_DISABLE_JIT"] = "1"
14os.environ["TORCHINDUCTOR_DISABLE"] = "1"
15os.environ["TORCH_COMPILE_DISABLE"] = "1"
16
17import torch._dynamo
18torch._dynamo.config.suppress_errors = True
19
20import torch
21from transformers import (
22 AutoTokenizer,
23 AutoModelForSequenceClassification,
24 Trainer,
25 TrainingArguments
26)
27
28print("PyTorch version:", torch.__version__)
29# PyTorch version: 2.6.0+rocm6.1
30print("CUDA/ROCm available:", torch.cuda.is_available())
31# amdgpu.ids: No such file or directory
32# amdgpu.ids: No such file or directory
33# CUDA/ROCm available: True
34print("Number of GPUs:", torch.cuda.device_count())
35# Number of GPUs: 2
36print("Current GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
37# Current GPU name: AMD Radeon Graphics
38print(torch.version.hip)
39# 6.1.40091-a8dbc0c19
40
41# MODEL = "answerdotai/ModernBERT-base"
42MODEL = "jhu-clsp/mmBERT-base"
43device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
44print(f"Running on {device}")
45# Running on cuda
46
47tokenizer = AutoTokenizer.from_pretrained(MODEL)
48# tokenizer_config.json: 20.8kB [00:00, 47.0MB/s]
49# tokenizer.json: 2.13MB [00:00, 113MB/s]
50# special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████| 694/694 [00:00<00:00, 4.95MB/s]
51
52model = AutoModelForSequenceClassification.from_pretrained(
53 MODEL, num_labels=2
54).to(device)
55# config.json: 1.19kB [00:00, 4.65MB/s]
56# amdgpu.ids: No such file or directory
57# amdgpu.ids: No such file or directory
58# model.safetensors: 100%|████████████████████████████████████████████████████████████████████| 599M/599M [00:01<00:00, 435MB/s]
59# Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at answerdotai/ModernBERT-base and are newly # initialized: ['classifier.bias', 'classifier.weight']
60# You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
61texts = [f"example sentence {i}" for i in range(2050)]
62labels = [i % 2 for i in range(2050)]
63
64batch = tokenizer(
65 texts,
66 padding=True,
67 truncation=True,
68 return_tensors="pt"
69)
70batch["labels"] = torch.tensor(labels)
71
72
73train_data = []
74for i in range(len(labels)):
75 train_data.append({
76 "input_ids": batch["input_ids"][i],
77 "attention_mask": batch["attention_mask"][i],
78 "labels": batch["labels"][i]
79 })
80
81
82args = TrainingArguments(
83 output_dir="./out",
84 per_device_train_batch_size=2,
85 max_steps=1, # just 1 training step
86 report_to="none",
87 no_cuda=not torch.cuda.is_available(), # auto disable if no GPU
88)
89
90not torch.cuda.is_available()
91# >>> False
92not False
93# >>> True
94
95with torch.no_grad():
96 outputs = model(
97 batch["input_ids"].to(device),
98 attention_mask=batch["attention_mask"].to(device)
99 )
100 logits = outputs.logits.cpu()
101
102
103:0:rocdevice.cpp :2881: 15722138585593 us: [pid:1 tid:0x7faccf3ff6c0] Callback: Queue 0x7faa34100000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

Thanks @gkyziridis for digging into this! Out of curiosity, why not jump to the current stable versions (2.9.1 for torch and 6.4 for AMD)? I see you commented that line out in the initial file that at least had torch at 2.9.1.

Thanks @gkyziridis for digging into this! Out of curiosity, why not jump to the current stable versions (2.9.1 for torch and 6.4 for AMD)? I see you commented that line out in the initial file that at least had torch at 2.9.1.

I was actually trying to do minor steps and gradually upgrade the torch/rocm drivers, I did not want to go directly to the latest one.
Nevertheless, this combination of versions seems to fix the issue using the GPU image, so your curiosity is in a super good shape towards the correct direction :P.

Results

Dockerfile.gpu:

1# Base image
2FROM docker-registry.wikimedia.org/bookworm:20251207
3
4# Set root user
5USER root
6
7# Proxy build arguments
8ENV http_proxy=http://webproxy:8080
9ENV https_proxy=http://webproxy:8080
10ENV no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org,.wikibooks.org,.wikiquote.org,.wiktionary.org,.wikisource.org,.wikispecies.org,.wikiversity.org,.wikidata.org,.mediawiki.org,.wikinews.org,.wikivoyage.org
11
12# Install dependencies including Python headers
13RUN apt-get update && \
14 apt-get install -y --no-install-recommends \
15 python3.11 python3.11-venv python3.11-distutils python3.11-dev python3-pip \
16 build-essential git curl ca-certificates \
17 && rm -rf /var/lib/apt/lists/*
18
19
20# Set python3.11 as default python
21RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
22
23# Create a virtual environment
24RUN python -m venv /opt/venv
25ENV PATH="/opt/venv/bin:$PATH"
26
27# Upgrade pip inside venv
28RUN pip install --upgrade pip
29
30# Install PyTorch with ROCm support
31RUN pip install torch==2.9.1+rocm6.4 --index-url https://download.pytorch.org/whl/rocm6.4
32
33# Install ML packages
34RUN pip install numpy pandas transformers accelerate safetensors datasets
35
36# Set ROCm environment variables
37ENV ROCM_PATH=/opt/rocm
38ENV PATH=$ROCM_PATH/bin:$PATH
39ENV LD_LIBRARY_PATH=$ROCM_PATH/lib:$ROCM_PATH/lib64:$LD_LIBRARY_PATH
40
41# Set working directory
42WORKDIR /workspace
43
44# Default command
45CMD ["python"]

Build it using:

docker build --network=host -f Dockerfile.gpu -t torch_rocm_gpu2964 .

Run it using:

$ docker run --rm --network=host -it \
--device=/dev/kfd --device=/dev/dri \
--group-add=$(getent group video | cut -d: -f3) \
--group-add=$(getent group render | cut -d: -f3) \
--ipc=host \
--security-opt seccomp=unconfined \
torch_rocm_gpu2964

Train model dummy script:

1import torch
2from transformers import (
3 AutoTokenizer,
4 AutoModelForSequenceClassification,
5 Trainer,
6 TrainingArguments
7)
8
9
10print("PyTorch version:", torch.__version__)
11# PyTorch version: 2.9.1+rocm6.4
12
13print("CUDA/ROCm available:", torch.cuda.is_available())
14# /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
15# /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
16# CUDA/ROCm available: True
17
18print("Number of GPUs:", torch.cuda.device_count())
19# Number of GPUs: 2
20
21print("Current GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
22# Current GPU name: AMD Radeon Graphics
23
24print(torch.version.hip)
25# 6.4.43484-123eb5128
26
27
28MODEL = "jhu-clsp/mmBERT-base"
29device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
30print(f"Running on {device}")
31# Running on cuda
32
33tokenizer = AutoTokenizer.from_pretrained(MODEL)
34# tokenizer_config.json: 46.4kB [00:00, 104MB/s]
35# tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| # 17.5M/17.5M [00:00<00:00, 17.6MB/s]
36# special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████| 636/636 [00:00<00:00, 4.35MB/s]
37
38model = AutoModelForSequenceClassification.from_pretrained(
39 MODEL, num_labels=2
40).to(device)
41# config.json: 1.19kB [00:00, 4.96MB/s]
42# pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████| 1.23G/1.23G [00:02<00:00, 416MB/s]
43# Some weights of ModernBertForSequenceClassification were not initialized from the model checkpoint at jhu-clsp/mmBERT-base and are newly initialized: ['classifier.bias', 'classifier.weight']
44# You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. # | 727k/1.23G [00:01<30:57, 663kB/s]model.safetensors: 100%|███████████████████████████████████████████████████████████████████| 1.23G/1.23G
45
46texts = [f"example sentence {i}" for i in range(2050)]
47labels = [i % 2 for i in range(2050)]
48
49batch = tokenizer(
50 texts,
51 padding=True,
52 truncation=True,
53 return_tensors="pt"
54)
55batch["labels"] = torch.tensor(labels)
56
57train_data = []
58for i in range(len(labels)):
59 train_data.append({
60 "input_ids": batch["input_ids"][i],
61 "attention_mask": batch["attention_mask"][i],
62 "labels": batch["labels"][i]
63 })
64
65
66args = TrainingArguments(
67 output_dir="./out",
68 per_device_train_batch_size=2,
69 max_steps=1, # just 1 training step
70 report_to="none",
71 no_cuda=not torch.cuda.is_available(), # auto disable if no GPU
72)
73
74with torch.no_grad():
75 outputs = model(
76 batch["input_ids"].to(device),
77 attention_mask=batch["attention_mask"].to(device)
78 )
79 logits = outputs.logits.cpu()
80# /opt/venv/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:312: UserWarning: TensorFloat32 tensor cores for float32 matrix
81# multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn(
82
83all_zero = torch.all(logits == 0)
84has_nan = torch.isnan(logits).any()
85
86print("\n=== RESULTS ===")
87
88# === RESULTS ===
89
90print(logits)
91# tensor([[-2.1368, 0.9964],
92# [-1.7954, 0.6751],
93# [-2.5328, 1.1171],
94# [-2.3959, 0.8501],
95# [-2.4323, 0.8784],
96# [-2.3603, 0.8249]])
97
98print("All zeros? :", bool(all_zero))
99# All zeros? : False
100
101print("Contains NaNs?", bool(has_nan))
102# Contains NaNs? False

Nevertheless, this combination of versions seems to fix the issue using the GPU image, so your curiosity is in a super good shape towards the correct direction :P.

Haha, always happy to be accidentally helpful :) Once it's deployed on ml-lab1002, happy to test but definitely looking promising!

Nevertheless, this combination of versions seems to fix the issue using the GPU image, so your curiosity is in a super good shape towards the correct direction :P.

Haha, always happy to be accidentally helpful :) Once it's deployed on ml-lab1002, happy to test but definitely looking promising!

Hey @Isaac, since the combination of torch/rocm versions was working, then we will install them into ml-lab1002.
In the meanwhile, you can always use the above scripts from: https://phabricator.wikimedia.org/T410663#11446759 comment, and start experimenting with the model and drivers until we install them into the ml-lab1002 machine.

Keep in mind that there is now an ml-build1001.eqiad.wmnet machine (the old ml-lab1001) where we will work in a container fashion: "build docker with torch/rocm versions we want -> install packages -> work inside the container".
I created this Phabricator task: https://phabricator.wikimedia.org/T412357 so we can keep track of the progress.