The torch version on ml-lab1001 is 2.4.1+rocm6.1 while the current version is 2.9.1. I'll share my use-case as it likely covers at least a few of the core concerns:
I was seeking to fine-tune a ModernBert-family model (this one but the issues trigger with any model in the family). The issues encountered:
- That particular model doesn't have a safetensors version so if you try to load the torch version, you get the error below. That can be fixed by referring to a safetensors version that's available in a MR on HuggingFace, but it's kinda hacky:
ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors. See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434
- Once you switch to safetensors, you get a #include <Python.h> exception. This can be avoided by doing the following (as suggested by the error message):
import torch._dynamo torch._dynamo.config.suppress_errors = True
- Once you get through both of those errors, the model does train but it only spits out 0s. This gets caught when evaluating the model, where you get the following exception (and inspecting the predictions shows that they're all 0s). Some sleuthing suggested that this might be due to the old torch version (details):
ValueError: Input contains NaN
All this can be replicated with /home/isaacj/mmbert-peacock/training_peacock_mmbert.ipynb on ml-lab1001. I was able to get a mBERT model to train just fine so I don't think it's the code/data.