Page MenuHomePhabricator

GPU errors in hf image in ml-staging
Open, Needs TriagePublicBUG REPORT

Description

While deploying huggingface models in ml-staging in T357986: Use Huggingface model server image for HF LLMs we experienced an issue when trying to use the GPU for running inference.
The image used is machinelearning-liftwing-inference-services-huggingface:2024-04-08-110759-publish which is based on amd-pytorch21:2.1.2rocm5.5-1.

This means pytorch version 2.1.2 and rocm5.5. Keep in mind that the rocm drivers installed on the node are 5.4.2 so it could be a potential reason for the problem.

In order to replicate we can attach a shell to the running container and execute the following in a python console. This will give us:

>>> import torch
>>> torch.cuda.is_available()
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
False

the strace for the above reveals the following:

openat(AT_FDCWD, "/dev/dri/renderD128", O_RDWR|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFCHR|0666, st_rdev=makedev(0xe2, 0x80), ...}) = 0
stat("/sys/dev/char/226:128/device/drm", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
access("/dev/dri/card128", F_OK)        = -1 ENOENT (No such file or directory)
access("/dev/dri/renderD128", F_OK)     = -1 EPERM (Operation not permitted)
ioctl(7, DRM_IOCTL_GET_CLIENT, 0x7fffeea94eb0) = -1 EACCES (Permission denied)
write(2, "amdgpu_device_initialize: amdgpu"..., 58) = 58

Event Timeline

isarantopoulos renamed this task from GPU errors in ml-staging to GPU errors in hf image in ml-staging.Fri, Apr 19, 2:16 PM
isarantopoulos updated the task description. (Show Details)

Debian bookworm has a different version of the libdrm-amdgpu1 package as we can see in the repository . We could try to use bullseye to see if this is solved. The problem is that even if it is we will still need to solve this problem in the future when we upgrade debian version.

Change #1023414 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/docker-images/production-images@master] amdpytorch21: use bullseye as pytorch base image

https://gerrit.wikimedia.org/r/1023414

At the moment we have tried/used things in the following matrix. Success/fail refers to whether the GPU has been successfully been used with pytorch.

ospytorch versionpytorch rocm buildSuccess/fail
bullseye2.0.15.4.2
bookworm2.1.25.5.

To reflect our discussion in the team meeting we said that although trying bullseye with pytorch2.1.2-rocm5.5 would be a good test even if it works we'd end up in the same place in the future since downgrading the os version would only be temporary.
Instead we discussed about the following options (please add if there are more):

  • Update ROCm drivers on the node to ROCm 5.5
  • Remove drivers related to ROCm from the node and test if we can use only drivers from an image so we can control/upgrade rocm versions by maintaining the pytorch base image.

Quick clarification - there are currently two places where we use ROCm-specific libs:

  1. On every stat node and k8s node with a GPU, we deploy the Debian packages containing all the sw stack etc.. For the drivers we rely on the ones deployed via the Linux Kernel.
  2. Via Pytorch on our base Docker images.

I am almost convinced that we could avoid deploying ROCm debian packages on k8s nodes since at the moment, they shouldn't be used. The k8s gpu plugin on those nodes exposes the device to containers, and it just need the kernel to recognize the GPU. So we can already upgrade ROCm on Docker image separately from the k8s workers, and to avoid confusion we should probably try remove GPU drivers on our k8s nodes and test. I'll open a task for it!

Change #1023414 abandoned by Ilias Sarantopoulos:

[operations/docker-images/production-images@master] amdpytorch21: use bullseye as pytorch base image

Reason:

https://gerrit.wikimedia.org/r/1023414

Change #1026952 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd/pytorch21: update ROCm drivers to 5.7

https://gerrit.wikimedia.org/r/1026952

Change #1026952 merged by Elukey:

[operations/docker-images/production-images@master] amd/pytorch21: update ROCm drivers to 5.7

https://gerrit.wikimedia.org/r/1026952

This should be the diff between libdr bullseye (2.4.104) and bookworm (2.4.114) versions:

https://salsa.debian.org/xorg-team/lib/libdrm/-/compare/libdrm-2.4.104-1...libdrm-2.4.114-1?from_project_id=4296&straight=false

I don't see any mention of amdgpu_get_auth, but amdgpu_device_get_fd is added and it may be related with the error that is mentioned in the task's description.

Let's see if ROCm 5.7 works better with libdrm :)

== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1

Change #1028393 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] huggingface: upgrade base image

https://gerrit.wikimedia.org/r/1028393

== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/amd-pytorch21:2.1.2rocm5.7-1

Sigh too early :( It turns out that the Docker layer with the ROCm packages is now ~5GB compressed, so it failed to be uploaded to the registry (limit is 4GB). No idea why, since overall the Docker layer is less in size than the Pytorch 2.2 one. Maybe some binary/non-compressable data is present in 2.1 and not in 2.2?

Ah wow my bad! I inspected the docker image and it contains a ton of Nvidia binaries. Will review again the install procedure, really sneaky.

Change #1028519 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-pytorch21: fix the ROCm version

https://gerrit.wikimedia.org/r/1028519

Change #1028519 merged by Elukey:

[operations/docker-images/production-images@master] amd-pytorch21: fix the ROCm version

https://gerrit.wikimedia.org/r/1028519

Change #1028393 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] huggingface: upgrade base image

https://gerrit.wikimedia.org/r/1028393

Change #1028839 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update hugging face's Docker image

https://gerrit.wikimedia.org/r/1028839

Change #1028839 merged by Elukey:

[operations/deployment-charts@master] ml-services: update hugging face's Docker image

https://gerrit.wikimedia.org/r/1028839

Still seeing the old issue with ROCm 5.6:

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)                                                                      
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

A lot of useful info in https://en.wikipedia.org/wiki/Direct_Rendering_Manager, it is also mentioned DRM-Auth and what it does.

Tried https://wikitech.wikimedia.org/wiki/Machine_Learning/AMD_GPU#Reset_the_GPU_state and killed/restarted the mistral pod, just as a test to see if anything was in a weird state, but same error.

I have also rebooted ml-staging2001, to see if anything was in a weird state, same error.

Change #1029218 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/amd-k8s-device-plugin@master] Release new version

https://gerrit.wikimedia.org/r/1029218

Change #1029619 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: Update Docker image for nllb-gpu

https://gerrit.wikimedia.org/r/1029619

Change #1029619 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update Docker image for nllb-gpu

https://gerrit.wikimedia.org/r/1029619

Change #1029625 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add nllb-gpu to ml-staging

https://gerrit.wikimedia.org/r/1029625

Change #1029625 merged by Elukey:

[operations/deployment-charts@master] ml-services: add nllb-gpu to ml-staging

https://gerrit.wikimedia.org/r/1029625

On ml-staging2001 I checked the pod's details (via docker inspect) and found:

"Devices": [
    {
        "PathOnHost": "/dev/kfd",
        "PathInContainer": "/dev/kfd",
        "CgroupPermissions": "rw"
    },
    {
        "PathOnHost": "/dev/dri/card1",
        "PathInContainer": "/dev/dri/card1",
        "CgroupPermissions": "rw"
    },
    {
        "PathOnHost": "/dev/dri/renderD128",
        "PathInContainer": "/dev/dri/renderD128",
        "CgroupPermissions": "rw"
    }
],

Plus the linux permissions are ok in the container, namely other (so including user somebody) is able to read/write the right devices (namely renderD128 and kfd). The same config and permission is present for the nllb-200-gpu-predictor-default pod that runs on ml-serve1001, not showing this problem.

At this point it seems to me that the issue may lie in the combination pytorch 2.1 + ROCm 5.6 + MI100 GPU + kernel 5.10.209

I had a chat with Ilias to try the pytorch 2.2 image, on which we have ROCm 5.7 (last of the 5.x series) that may help (also Pytorch's upstream may have needed some fixes, but I don't find github issues related to that).

I have also made a test: deploy nllb-gpu to staging, using the MI100 GPU (removing it from mistral-7b). It works fine, I can see the GPU usable without errors.

At this point I'd test the Pytorch 2.2 base image to see if we have a different result, if not it is the base image itself doing something wrong.

Change #1029218 merged by Elukey:

[operations/debs/amd-k8s-device-plugin@master] Release new version

https://gerrit.wikimedia.org/r/1029218

Regarding the nllb-gpu deployment: we have successfully tested it when we first obtained the MI100. The deployment was just removed at some point as it wasn't used and we wan't to start working on the GPU with huggingface image. The same stands for article-descriptions model server.
However both have only been tested only with torch 2.0.1-rocm5.4.2.
So let's try a different combination. KServe upstream updated the huggingfaceserver recently to support torch 2.3.0 which will also be available in kserve 0.13. This means that we can test rocm 5.7 with it.
I am working towards updating this.

Another thing we can try in order to figure out the issue is test the existing llm image with a more recent torch + rocm version.

Change #1030059 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: bump torch and rocm version

https://gerrit.wikimedia.org/r/1030059

Change #1030084 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] llm: update to Bookworm

https://gerrit.wikimedia.org/r/1030084

Change #1030084 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: update to Bookworm

https://gerrit.wikimedia.org/r/1030084

I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):

Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 9, in <module>
    import torch
  File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument

Haven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists

I got an error when trying llm image locally with bullseye-torch2.3.0-rocm5.7 (related patch):

Traceback (most recent call last):
  File "/srv/app/llm/model.py", line 9, in <module>
    import torch
  File "/opt/lib/python/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: libamdhip64.so: cannot enable executable stack as shared object requires: Invalid argument

Haven't found any related open issue so I'm currently testing different pytorch versions to see if the issue still exists

Since the torch._C import indicates compiled C code, my best guess here is that it's trying to run x86_64 code on an M1/Arm CPU, which obviously would fail.

Change #1030059 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7)

https://gerrit.wikimedia.org/r/1030059

It seems not to be related to the OS, since nllb-gpu on Bookworm ran fine on ml-staging2001 (with the GPU).

Something worth to notice: I tried to set up a simple experiment for both nllb and mistral, namely running the following via kubectl exec and stracing it on the host:

import torch; import time; time.sleep(15); torch.cuda.is_available()

For nllb I see the following:

[pid 3043297] openat(AT_FDCWD, "/dev/dri/renderD129", O_RDWR|O_CLOEXEC) = 7
[pid 3043297] fstat(7, {st_mode=S_IFCHR|0666, st_rdev=makedev(0xe2, 0x81), ...}) = 0
[pid 3043297] stat("/sys/dev/char/226:129/device/drm", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 3043297] ioctl(7, DRM_IOCTL_VERSION, 0x7a66580) = 0
[pid 3043297] ioctl(7, DRM_IOCTL_VERSION, 0x7a66580) = 0
[pid 3043297] fcntl(7, F_DUPFD_CLOEXEC, 0) = 8
[pid 3043297] ioctl(8, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1380) = 0
[pid 3043297] ioctl(8, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1350) = 0
[pid 3043297] ioctl(8, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1350) = 0
[..]
[pid 3043297] stat("/sys/dev/char/226:129/device/drm", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 3043297] ioctl(5, DRM_IOCTL_VERSION, 0x8889730) = 0
[pid 3043297] ioctl(5, DRM_IOCTL_VERSION, 0x8889730) = 0
[pid 3043297] fcntl(5, F_DUPFD_CLOEXEC, 0) = 7
[pid 3043297] ioctl(7, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1550) = 0
[pid 3043297] ioctl(7, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1520) = 0
[pid 3043297] ioctl(7, DRM_IOCTL_AMDGPU_INFO or DRM_IOCTL_SIS_FB_FREE, 0x7ffeec8b1520) = 0

But no access("/dev/dri/renderD129", F_OK) for example (D129 is the name of the device mounted on ml-serve1001).

So I tried the following:

kubectl exec nllb-200-gpu-predictor-default-00008-deployment-645655978ddcdd7 -n llm -- /usr/bin/python3 -c "import os; import time; print(os.access('/dev/dri/renderD129', os.F_OK))"
False

Now I am wondering if Pytorch 2.{1,2} and/or ROCm 5.x have changed something that triggers an access syscall.

Change #1031001 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: test nllb image with torch221-rocm5.7

https://gerrit.wikimedia.org/r/1031001

Change #1031001 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: test nllb image with torch221-rocm5.7

https://gerrit.wikimedia.org/r/1031001

Ok finally something that is consistent: NLLB with pytorch 2.2.1 and ROCm 5.7 shows:

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

Re-ran the kubectl exec test and I see:

[pid 1673023] access("/dev/dri/renderD128", F_OK) = -1 EPERM (Operation not permitted)

What we know:

  • It shouldn't be the kernel or the OS version/packages (like libdrm)
  • It seems to be something related to Pytorch 2.1 and 2.2, coupled with ROCm 5.6/5.7 (the ones that we tested)
  • It works with Pytorch 2.0 and ROCm 5.4

From https://man7.org/linux/man-pages/man2/access.2.html

access() checks whether the calling process can access the file pathname. If pathname is a symbolic link, it is dereferenced.

The mode specifies the accessibility check(s) to be performed, and is either the value F_OK, or a mask consisting of the bitwise OR of one or more of R_OK, W_OK, and X_OK. F_OK tests for the existence of the file. R_OK, W_OK, and X_OK test whether the file exists and grants read, write, and execute permissions, respectively.

The check is done using the calling process's real UID and GID, rather than the effective IDs as is done when actually attempting an operation (e.g., open(2)) on the file. Similarly, for the root user, the check uses the set
of permitted capabilities rather than the set of effective capabilities; and for non-root users, the check uses an empty set of capabilities.

This allows set-user-ID programs and capability-endowed programs to easily determine the invoking user's authority. In other words, access() does not answer the "can I read/write/execute this file?" question. It answers a slightly different question: "(assuming I'm a setuid binary) can the user who invoked me read/write/execute this file?", which gives set-user-ID programs the possibility to prevent malicious users from causing them to read files which users shouldn't be able to read.

If the calling process is privileged (i.e., its real UID is zero), then an X_OK check is successful for a regular file if execute permission is enabled for any of the file owner, group, or other.

While we're not using suid/sgid binaries, this makes me wonder if being in a container changes the outcome of open() vs. access, because the in-container user is not the same as the base OS level on (user NS).

This is totally strange:

root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental  -- ls -l /dev/dri
total 0
crw-rw---- 1 root video 226,   1 May 13 15:38 card1
crw-rw-rw- 1 root   106 226, 128 May 13 15:38 renderD128

root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental  -- /usr/bin/python3 -c "import os; print(os.access('/dev/dri/renderD128', os.F_OK))"
False
root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental  -- /usr/bin/python3 -c "import os; print(os.access('/dev/dri/renderD128', os.R_OK))"
True
root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental  -- /usr/bin/python3 -c "import os; print(os.access('/dev/dri/renderD128', os.W_OK))"
True
root@deploy1002:~# kubectl exec nllb-200-gpu-predictor-00001-deployment-7fc4b4f798-9sv28 -n experimental  -- /usr/bin/python3 -c "import os; print(os.access('/dev/dri/renderD128', os.X_OK))"
False

So F_OK leads to False, namely file not existent, but the rest is consistent with our permissions.

Change #1031020 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] llm: bump torch and rocm 5.7 versions (2.2.1-rocm5.7)

https://gerrit.wikimedia.org/r/1031020

Change #1031020 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: bump torch and rocm 6.0 versions (2.3.0-rocm6.0)

https://gerrit.wikimedia.org/r/1031020

Tested llm image for nllb-200 with pytorch 2.3.0 and rocm 6.0 and got the same errors:

amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)
amdgpu_device_initialize: amdgpu_get_auth (1) failed (-1)

I manually changed the image in the experimental namespace in ml-staging and tested this. After the test I reverted it back to the previous one, so the deployment now still has pytorch 2.2.1 and rocm 5.7.

Janis from ServiceOps suggested that maybe seccomp or apparmor are playing a role into this.

jayme@ml-staging2001:~$ aa-exec -p docker-default -- /usr/bin/python3 -c "import os; print(os.access('/dev/dri/renderD128', os.F_OK))"
True

This tells us that AppArmor shouldn't be the culprit, now we should find a way to test if removing seccomp profiles makes our life better. In this list, that should be what Docker uses by default, we have both access and ioctl allowed but maybe some args are not ok? (like F_OK)

Following an advice from Janis, I tried on ml-staging2001:

sudo docker run --cap-drop all --rm -it --device /dev/dri/renderD128 --security-opt no-new-privileges --entrypoint /bin/bash 8df6203550c2

This should mimic what we run with k8s, and the result is:

>>> import os
>>> os.access('/dev/dri/renderD128', os.F_OK)
True

The seccomp default profile is applied (what docker ships by default), meanwhile I also tried with --security-opt apparmor=docker-default and I got the same result.

Not sure what is playing a role in here but we are getting closer.

Even better:

sudo docker run --rm -it --device /dev/dri/renderD128 --security-opt=no-new-privileges --cap-drop ALL --entrypoint /usr/bin/python3 8df6203550c2 -c "import os; import time; print(os.access('/dev/dri/renderD128', os.F_OK)); time.sleep(500)

Two more data points that don't help at all:

jayme@ml-staging2001:~$ sudo docker exec -it --user 0 k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False
jayme@ml-staging2001:~$ sudo docker exec -it --user 0 --privileged k8s_kserve-container_nllb-200-gpu-predictor-00007-deployment-678689d65f-f8xfx_experimental_32d35f31-d95c-48e2-b8c7-f8345eb699e7_0 /usr/bin/python3 -c 'import os; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK)))'
real: 0:0, effective: 0:0, result: False

You got me @elukey :-p
For reasons I did not try to understand yet, the mknod cgroup permission is the culprit. Without it, the access() call fails:

jayme@ml-staging2001:~$ sudo docker run --rm -it --device /dev/dri/renderD128:/dev/dri/renderD128:rw --security-opt=no-new-privileges --cap-drop ALL --entrypoint /usr/bin/python3 8df6203550c2 -c 'import os; import time; print("real: %d:%d, effective: %d:%d, result: %s" % (os.getuid(),os.getgid(),os.geteuid(),os.getegid(),os.access("/dev/dri/renderD128", os.F_OK))); time.sleep(500)'
real: 65533:65533, effective: 65533:65533, result: False

The device plugin maps all devices in rw mode, docker uses rwm by default (https://github.com/ROCm/k8s-device-plugin/blob/736a6ef17a3f9570c5fd81fec73cce50e849e272/cmd/k8s-device-plugin/main.go#L231)

Opened https://github.com/ROCm/k8s-device-plugin/issues/65

Thanks a lot Janis <3

Next steps:

  • Wait for upstream to get back to us, but it could take days.
  • Figure out why mknod capabilities are needed in the cgroup to make access(F_OK) working. We could add a custom patch to the k8s-device-plugin to allow the extra permissions when attaching the GPU to the container (diverging from upstream), but without knowing exactly why this is needed it seems a potential wrong way to go.

Finally we found the issue, see https://github.com/ROCm/k8s-device-plugin/issues/65#issuecomment-2115414637

The only option seems to be to upgrade ml-staging2001 to Bookworm, but of course we'd be the first ones to do it (since the rest of the clusters run on Bullseye). The upgrade would also solve T363191, so we could experiment two things at the same time.

I am inclined to open another task to track the upgrade on a k8s worker node to Bookworm.