Page MenuHomePhabricator

Review and test the AMD GPU kubernetes plugin
Closed, ResolvedPublic

Description

Thanks to several people, we now have the following settings in the DSE cluster:

  1. k8s 1.23
  2. A ROCm-compatible AMD GPU (thanks Ben!)

We should now be able to review and test https://github.com/RadeonOpenCompute/k8s-device-plugin

Main things to check:

  1. The security model of the plugin seems to require a daemonset deployed on all nodes with high privileges. We should follow up with ServiceOps to understand what best practice we should follow.
  2. Is the support for labeling ok? Not all nodes will have GPUs, so we'll need to be able to schedule pods only on the ones in need of it.
  3. Does it play well with ROCm drivers?
  4. Should we use the upstream helm chart or something different?

Ideally at the end we should be able to run a simple app using the GPU on a DSE cluster pod.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+26 -13
operations/puppetproduction+13 -0
operations/docker-images/production-imagesmaster+7 -1
operations/docker-images/production-imagesmaster+7 -1
operations/docker-images/production-imagesmaster+13 -6
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/docker-images/production-imagesmaster+8 -2
operations/docker-images/production-imagesmaster+9 -4
operations/docker-images/production-imagesmaster+24 -2
operations/docker-images/production-imagesmaster+7 -1
operations/docker-images/production-imagesmaster+7 -1
operations/docker-images/production-imagesmaster+10 -1
operations/docker-images/production-imagesmaster+10 -2
operations/docker-images/production-imagesmaster+30 -0
operations/debs/amd-k8s-device-pluginmaster+89 -0
operations/docker-images/production-imagesmaster+71 -0
operations/puppetproduction+6 -0
Show related patches Customize query in gerrit

Event Timeline

Change 908210 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::dse_k8s::worker: add AMD GPU support

https://gerrit.wikimedia.org/r/908210

Change 908210 merged by Elukey:

[operations/puppet@production] role::dse_k8s::worker: add AMD GPU support

https://gerrit.wikimedia.org/r/908210

Change 908792 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add new images to support AMD GPUs on k8s

https://gerrit.wikimedia.org/r/908792

Just had a chat with @JMeybohm, we have now a better understanding of how a device plugin works. Starting point:

https://v1-23.docs.kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration

So the two go binaries that the daemonset runs are basically two grpc services, that register with the Kubelet and that the Kubelet can call to discover and access devices. The containers do have to mount /var/lib/kubelet/device-plugin, at the moment requiring root capabilities, so it makes more sense to just package the go binaries in a simple deb package and use systemd units alongside the kubelet one on the k8s nodes with a GPU.

I've requested a new repository in:
https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests

Change 908792 abandoned by Elukey:

[operations/docker-images/production-images@master] Add new images to support AMD GPUs on k8s

Reason:

https://gerrit.wikimedia.org/r/908792

Tested manually the binary on dse-k8s-worker1001:

I0414 14:48:57.821469 2034547 main.go:305] AMD GPU device plugin for Kubernetes
I0414 14:48:57.821500 2034547 main.go:305] ./usr/bin/k8s-device-plugin version 
I0414 14:48:57.821505 2034547 main.go:305] hwloc: _VERSION: 2.9.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020400
I0414 14:48:57.821523 2034547 manager.go:42] Starting device plugin manager
I0414 14:48:57.821529 2034547 manager.go:46] Registering for system signal notifications
I0414 14:48:57.821668 2034547 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0414 14:48:57.822408 2034547 manager.go:60] Starting Discovery on new plugins
I0414 14:48:57.822418 2034547 manager.go:66] Handling incoming signals
I0414 14:48:57.822428 2034547 manager.go:71] Received new list of plugins: [gpu]
I0414 14:48:57.822478 2034547 manager.go:110] Adding a new plugin "gpu"
I0414 14:48:57.822488 2034547 plugin.go:64] gpu: Starting plugin server
I0414 14:48:57.822493 2034547 plugin.go:94] gpu: Starting the DPI gRPC server
I0414 14:48:57.822750 2034547 plugin.go:112] gpu: Serving requests...
I0414 14:49:07.824083 2034547 plugin.go:128] gpu: Registering the DPI with Kubelet
I0414 14:49:07.824663 2034547 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I0414 14:49:07.827041 2034547 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:3d:00.0
I0414 14:49:07.827079 2034547 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:da:00.0
I0414 14:49:07.888209 2034547 main.go:149] Watching GPU with bus ID: 0000:3d:00.0 NUMA Node: [0]
I0414 14:49:07.888265 2034547 main.go:149] Watching GPU with bus ID: 0000:da:00.0 NUMA Node: [1]

I see the related unix sockets being created:

elukey@dse-k8s-worker1001:~$ sudo ls -l /var/lib/kubelet/device-plugins
total 4
srwxr-xr-x 1 root root   0 Apr 14 14:48 amd.com_gpu
-rw-r--r-- 1 root root   0 Apr 13 09:17 DEPRECATION
-rw------- 1 root root 124 Apr 14 14:49 kubelet_internal_checkpoint
srwxr-xr-x 1 root root   0 Apr 13 09:17 kubelet.sock

One of the possible tests could be:
https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml

Change 909177 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/amd-k8s-device-plugin@master] Add initial debianizazion

https://gerrit.wikimedia.org/r/909177

Change 909196 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add a simple Docker image to test AMD GPUs

https://gerrit.wikimedia.org/r/909196

Change 909177 merged by Elukey:

[operations/debs/amd-k8s-device-plugin@master] Add initial debianizazion

https://gerrit.wikimedia.org/r/909177

Change 909196 merged by Elukey:

[operations/docker-images/production-images@master] Add a simple Docker image to test AMD GPUs

https://gerrit.wikimedia.org/r/909196

Change 909256 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: set tensorflow-rocm version

https://gerrit.wikimedia.org/r/909256

Change 909256 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: set tensorflow-rocm version

https://gerrit.wikimedia.org/r/909256

Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:14:30Z] <elukey> upload amd-k8s-device-plugin deb (1.25.2.3-1) to bullseye-wikimedia - T333009

Change 909304 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add ROCm suite packages

https://gerrit.wikimedia.org/r/909304

Change 909304 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add ROCm suite packages

https://gerrit.wikimedia.org/r/909304

Change 909313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: reduce the ROCm packages installed

https://gerrit.wikimedia.org/r/909313

Change 909313 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: reduce the ROCm packages installed

https://gerrit.wikimedia.org/r/909313

Change 909604 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add rccl package (ROCm suite)

https://gerrit.wikimedia.org/r/909604

Change 909604 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add rccl package (ROCm suite)

https://gerrit.wikimedia.org/r/909604

Change 909609 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group

https://gerrit.wikimedia.org/r/909609

Change 909609 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group

https://gerrit.wikimedia.org/r/909609

Building a test image is turning up to be a difficult job, mainly due to permissions of the devices exposed by the k8s plugin. For example, on dse-k8s-worker1001 we have the following devices:

elukey@dse-k8s-worker1001:~$ ls -l /dev/kfd
crw-rw---- 1 root render 242, 0 Apr 13 09:16 /dev/kfd

elukey@dse-k8s-worker1001:~$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root        140 Apr 13 09:17 by-path
crw-rw---- 1 root video  226,   0 Apr 13 09:16 card0
crw-rw---- 1 root video  226,   1 Apr 13 09:17 card1
crw-rw---- 1 root video  226,   2 Apr 13 09:17 card2
crw-rw---- 1 root render 226, 128 Apr 13 09:17 renderD128
crw-rw---- 1 root render 226, 129 Apr 13 09:17 renderD129

IIUC from https://rocmdocs.amd.com/_/downloads/en/latest/pdf/ when using the GPU a user needs to be in the render posix group to be able to access KFD (Kernel Fusion Driver) and the renderXXXX DRI devices. The main problem is that the amd k8s plugin exposes the devices to the container keeping the same uid/gid, ending up in:

root@alexnet-tf-gpu-pod:/# ls -l /dev/kfd 
crw-rw---- 1 root 106 242, 0 Apr 18 15:58 /dev/kfd

root@alexnet-tf-gpu-pod:/# ls -l /dev/dri/renderD128 
crw-rw---- 1 root 106 226, 128 Apr 18 15:58 /dev/dri/renderD128

The gid 106 is the render group on dse-k8s-worker. so I am not really sure how to map this number to the OS in the container to grant proper access.

Change 909968 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] amd_gpu: add udev rules to bypass the 'render' group

https://gerrit.wikimedia.org/r/909968

Change 909969 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role:dse_k8s::worker: set allow_gpu_broader_access

https://gerrit.wikimedia.org/r/909969

Change 909970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add more ROCm packages

https://gerrit.wikimedia.org/r/909970

Change 909970 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add more ROCm packages

https://gerrit.wikimedia.org/r/909970

Change 910419 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: replace rocblas with rocblas-dev

https://gerrit.wikimedia.org/r/910419

Change 910419 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: replace rocblas with rocblas-dev

https://gerrit.wikimedia.org/r/910419

Change 909968 merged by Elukey:

[operations/puppet@production] amd_gpu: add udev rules to bypass the 'render' group

https://gerrit.wikimedia.org/r/909968

Change 909969 merged by Elukey:

[operations/puppet@production] role:dse_k8s::worker: set allow_gpu_broader_access

https://gerrit.wikimedia.org/r/909969

Change 912240 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::builder: add ml-runner user

https://gerrit.wikimedia.org/r/912240

Change 910743 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: reduce image size

https://gerrit.wikimedia.org/r/910743

I opened T335177 since the amd-gpu-test image size got up to 14G (uncompressed), and nginx on the docker registry nodes has a limit of 2G (compressed) so it eventually causes the upload request to timeout.

After a chat with Janis I failed back to another solution: mount /opt/rocm-5.4.0 to the pod in read-only mode, removing all the ROCm packages from the docker image.

Change 912240 merged by Elukey:

[operations/puppet@production] role::builder: add ml-runner user

https://gerrit.wikimedia.org/r/912240

Change 910743 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: reduce image size

https://gerrit.wikimedia.org/r/910743

Change 912300 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add libelf-dev to the package list

https://gerrit.wikimedia.org/r/912300

Change 912300 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add libelf-dev to the package list

https://gerrit.wikimedia.org/r/912300

Change 912313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add librdm

https://gerrit.wikimedia.org/r/912313

Change 912313 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add librdm

https://gerrit.wikimedia.org/r/912313

Finally I was able to run the alexnet tensorflow test on a DSE GPU:

TensorFlow:  2.11
Model:       alexnet
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             512 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['/gpu:0']    <=========================================
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 1190.3 +/- 0.0 (jitter = 0.0)	7.263
10	images/sec: 1183.2 +/- 2.1 (jitter = 9.8)	7.263
20	images/sec: 1177.6 +/- 2.0 (jitter = 9.5)	7.263
30	images/sec: 1177.7 +/- 1.6 (jitter = 9.6)	7.263
40	images/sec: 1177.7 +/- 1.3 (jitter = 8.5)	7.263
50	images/sec: 1177.1 +/- 1.2 (jitter = 9.5)	7.263
60	images/sec: 1176.4 +/- 1.2 (jitter = 9.6)	7.263
70	images/sec: 1175.8 +/- 1.1 (jitter = 9.4)	7.263
80	images/sec: 1175.3 +/- 1.0 (jitter = 10.2)	7.263
90	images/sec: 1174.7 +/- 1.0 (jitter = 10.5)	7.263
100	images/sec: 1174.3 +/- 0.9 (jitter = 10.2)	7.263
----------------------------------------------------------------
total images/sec: 1174.12
----------------------------------------------------------------

Change 912336 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: add support for the K8s device plugin on DSE

https://gerrit.wikimedia.org/r/912336

Change 912336 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: add support for the K8s device plugin on DSE

https://gerrit.wikimedia.org/r/912336

The task is done, we have successfully configured and run a job on a GPU on DSE! All the configs are also puppetized so we can apply the same to any Lift Wing node anytime (if they get a GPU).