While deploying huggingface models in ml-staging in T357986: Use Huggingface model server image for HF LLMs we experienced an issue when trying to use the GPU for running inference.
The image used is machinelearning-liftwing-inference-services-huggingface:2024-04-08-110759-publish which is based on amd-pytorch21:2.1.2rocm5.5-1.
This means pytorch version 2.1.2 and rocm5.5. Keep in mind that the rocm drivers installed on the node are 5.4.2 so it could be a potential reason for the problem.
In order to replicate we can attach a shell to the running container and execute the following in a python console. This will give us:
>>> import torch >>> torch.cuda.is_available() amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1) False
the strace for the above reveals the following:
openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR|O_CLOEXEC) = 7 fstat(7, {st_mode=S_IFCHR|0666, st_rdev=makedev(0xe2, 0x80), ...}) = 0 stat("/sys/dev/char/226:128/device/drm", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 access("/dev/dri/card128", F_OK) = -1 ENOENT (No such file or directory) access("/dev/dri/renderD128", F_OK) = -1 EPERM (Operation not permitted) ioctl(7, DRM_IOCTL_GET_CLIENT, 0x7fffeea94eb0) = -1 EACCES (Permission denied) write(2, "amdgpu_device_initialize: amdgpu"..., 58) = 58