Review and test the AMD GPU kubernetes plugin
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Mar 24 2023, 3:30 PM

Description

Thanks to several people, we now have the following settings in the DSE cluster:

k8s 1.23
A ROCm-compatible AMD GPU (thanks Ben!)

We should now be able to review and test https://github.com/RadeonOpenCompute/k8s-device-plugin

Main things to check:

The security model of the plugin seems to require a daemonset deployed on all nodes with high privileges. We should follow up with ServiceOps to understand what best practice we should follow.
Is the support for labeling ok? Not all nodes will have GPUs, so we'll need to be able to schedule pods only on the ones in need of it.
Does it play well with ROCm drivers?
Should we use the upstream helm chart or something different?

Ideally at the end we should be able to run a simple app using the GPU on a DSE cluster pod.

Details

Subject	Repo	Branch	Lines +/-
amd_gpu: add udev rules to bypass the 'render' group	operations/puppet	production	+26 -13
profile::amd_gpu: add support for the K8s device plugin on DSE	operations/puppet	production	+13 -0
amd-gpu-tester: add librdm	operations/docker-images/production-images	master	+7 -1
amd-gpu-tester: add libelf-dev to the package list	operations/docker-images/production-images	master	+7 -1
amd-gpu-tester: reduce image size	operations/docker-images/production-images	master	+13 -6
role::builder: add ml-runner user	operations/puppet	production	+1 -0
role:dse_k8s::worker: set allow_gpu_broader_access	operations/puppet	production	+2 -0
amd-gpu-tester: replace rocblas with rocblas-dev	operations/docker-images/production-images	master	+8 -2
amd-gpu-tester: add more ROCm packages	operations/docker-images/production-images	master	+9 -4
amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group	operations/docker-images/production-images	master	+24 -2
amd-gpu-tester: add rccl package (ROCm suite)	operations/docker-images/production-images	master	+7 -1
amd-gpu-tester: reduce the ROCm packages installed	operations/docker-images/production-images	master	+7 -1
amd-gpu-tester: add ROCm suite packages	operations/docker-images/production-images	master	+10 -1
amd-gpu-tester: set tensorflow-rocm version	operations/docker-images/production-images	master	+10 -2
Add a simple Docker image to test AMD GPUs	operations/docker-images/production-images	master	+30 -0
Add initial debianizazion	operations/debs/amd-k8s-device-plugin	master	+89 -0
Add new images to support AMD GPUs on k8s	operations/docker-images/production-images	master	+71 -0
role::dse_k8s::worker: add AMD GPU support	operations/puppet	production	+6 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T333462 Experiment with GPUs in the Machine Learning infrastructure
Resolved	elukey	T327923 Investigate procuring and installing GPUs on Lift Wing
Resolved	elukey	T333009 Review and test the AMD GPU kubernetes plugin

Event Timeline

elukey created this task.Mar 24 2023, 3:30 PM

elukey moved this task from Unsorted to New Projects to review on the Machine-Learning-Team board.

elukey added a parent task: T333462: Experiment with GPUs in the Machine Learning infrastructure.Mar 29 2023, 3:52 PM

Change 908210 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::dse_k8s::worker: add AMD GPU support

https://gerrit.wikimedia.org/r/908210

gerritbot added a project: Patch-For-Review.Apr 12 2023, 9:34 AM

Change 908210 merged by Elukey:

[operations/puppet@production] role::dse_k8s::worker: add AMD GPU support

https://gerrit.wikimedia.org/r/908210

Maintenance_bot removed a project: Patch-For-Review.Apr 12 2023, 1:30 PM

elukey claimed this task.Apr 13 2023, 3:46 PM

elukey moved this task from New Projects to review to In Progress on the Machine-Learning-Team board.

Change 908792 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add new images to support AMD GPUs on k8s

https://gerrit.wikimedia.org/r/908792

gerritbot added a project: Patch-For-Review.Apr 14 2023, 10:07 AM

Just had a chat with @JMeybohm, we have now a better understanding of how a device plugin works. Starting point:

https://v1-23.docs.kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration

So the two go binaries that the daemonset runs are basically two grpc services, that register with the Kubelet and that the Kubelet can call to discover and access devices. The containers do have to mount /var/lib/kubelet/device-plugin, at the moment requiring root capabilities, so it makes more sense to just package the go binaries in a simple deb package and use systemd units alongside the kubelet one on the k8s nodes with a GPU.

I've requested a new repository in:
https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests

Change 908792 abandoned by Elukey:

[operations/docker-images/production-images@master] Add new images to support AMD GPUs on k8s

Reason:

https://gerrit.wikimedia.org/r/908792

Maintenance_bot removed a project: Patch-For-Review.Apr 14 2023, 12:10 PM

Tested manually the binary on dse-k8s-worker1001:

I0414 14:48:57.821469 2034547 main.go:305] AMD GPU device plugin for Kubernetes
I0414 14:48:57.821500 2034547 main.go:305] ./usr/bin/k8s-device-plugin version 
I0414 14:48:57.821505 2034547 main.go:305] hwloc: _VERSION: 2.9.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020400
I0414 14:48:57.821523 2034547 manager.go:42] Starting device plugin manager
I0414 14:48:57.821529 2034547 manager.go:46] Registering for system signal notifications
I0414 14:48:57.821668 2034547 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0414 14:48:57.822408 2034547 manager.go:60] Starting Discovery on new plugins
I0414 14:48:57.822418 2034547 manager.go:66] Handling incoming signals
I0414 14:48:57.822428 2034547 manager.go:71] Received new list of plugins: [gpu]
I0414 14:48:57.822478 2034547 manager.go:110] Adding a new plugin "gpu"
I0414 14:48:57.822488 2034547 plugin.go:64] gpu: Starting plugin server
I0414 14:48:57.822493 2034547 plugin.go:94] gpu: Starting the DPI gRPC server
I0414 14:48:57.822750 2034547 plugin.go:112] gpu: Serving requests...
I0414 14:49:07.824083 2034547 plugin.go:128] gpu: Registering the DPI with Kubelet
I0414 14:49:07.824663 2034547 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I0414 14:49:07.827041 2034547 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:3d:00.0
I0414 14:49:07.827079 2034547 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:da:00.0
I0414 14:49:07.888209 2034547 main.go:149] Watching GPU with bus ID: 0000:3d:00.0 NUMA Node: [0]
I0414 14:49:07.888265 2034547 main.go:149] Watching GPU with bus ID: 0000:da:00.0 NUMA Node: [1]

I see the related unix sockets being created:

elukey@dse-k8s-worker1001:~$ sudo ls -l /var/lib/kubelet/device-plugins
total 4
srwxr-xr-x 1 root root   0 Apr 14 14:48 amd.com_gpu
-rw-r--r-- 1 root root   0 Apr 13 09:17 DEPRECATION
-rw------- 1 root root 124 Apr 14 14:49 kubelet_internal_checkpoint
srwxr-xr-x 1 root root   0 Apr 13 09:17 kubelet.sock

One of the possible tests could be:
https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/master/example/pod/alexnet-gpu.yaml

Change 909177 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/amd-k8s-device-plugin@master] Add initial debianizazion

https://gerrit.wikimedia.org/r/909177

gerritbot added a project: Patch-For-Review.Apr 17 2023, 7:25 AM

Change 909196 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Add a simple Docker image to test AMD GPUs

https://gerrit.wikimedia.org/r/909196

Change 909177 merged by Elukey:

[operations/debs/amd-k8s-device-plugin@master] Add initial debianizazion

https://gerrit.wikimedia.org/r/909177

elukey mentioned this in rADMK305aaa012da7: Add initial debianizazion.Apr 17 2023, 1:16 PM

Change 909196 merged by Elukey:

[operations/docker-images/production-images@master] Add a simple Docker image to test AMD GPUs

https://gerrit.wikimedia.org/r/909196

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2023, 1:30 PM

Change 909256 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: set tensorflow-rocm version

https://gerrit.wikimedia.org/r/909256

gerritbot added a project: Patch-For-Review.Apr 17 2023, 1:32 PM

Change 909256 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: set tensorflow-rocm version

https://gerrit.wikimedia.org/r/909256

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2023, 2:10 PM

Mentioned in SAL (#wikimedia-operations) [2023-04-17T14:14:30Z] <elukey> upload amd-k8s-device-plugin deb (1.25.2.3-1) to bullseye-wikimedia - T333009

Change 909304 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add ROCm suite packages

https://gerrit.wikimedia.org/r/909304

Change 909304 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add ROCm suite packages

https://gerrit.wikimedia.org/r/909304

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2023, 4:11 PM

Change 909313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: reduce the ROCm packages installed

https://gerrit.wikimedia.org/r/909313

Change 909313 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: reduce the ROCm packages installed

https://gerrit.wikimedia.org/r/909313

Maintenance_bot removed a project: Patch-For-Review.Apr 17 2023, 5:10 PM

Change 909604 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add rccl package (ROCm suite)

https://gerrit.wikimedia.org/r/909604

gerritbot added a project: Patch-For-Review.Apr 18 2023, 8:11 AM

Change 909604 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add rccl package (ROCm suite)

https://gerrit.wikimedia.org/r/909604

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 8:15 AM

Change 909609 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group

https://gerrit.wikimedia.org/r/909609

gerritbot added a project: Patch-For-Review.Apr 18 2023, 8:59 AM

Change 909609 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: improve run-test.sh and add nobody to the 'render' group

https://gerrit.wikimedia.org/r/909609

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2023, 10:14 AM

Building a test image is turning up to be a difficult job, mainly due to permissions of the devices exposed by the k8s plugin. For example, on dse-k8s-worker1001 we have the following devices:

elukey@dse-k8s-worker1001:~$ ls -l /dev/kfd
crw-rw---- 1 root render 242, 0 Apr 13 09:16 /dev/kfd

elukey@dse-k8s-worker1001:~$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root        140 Apr 13 09:17 by-path
crw-rw---- 1 root video  226,   0 Apr 13 09:16 card0
crw-rw---- 1 root video  226,   1 Apr 13 09:17 card1
crw-rw---- 1 root video  226,   2 Apr 13 09:17 card2
crw-rw---- 1 root render 226, 128 Apr 13 09:17 renderD128
crw-rw---- 1 root render 226, 129 Apr 13 09:17 renderD129

IIUC from https://rocmdocs.amd.com/_/downloads/en/latest/pdf/ when using the GPU a user needs to be in the render posix group to be able to access KFD (Kernel Fusion Driver) and the renderXXXX DRI devices. The main problem is that the amd k8s plugin exposes the devices to the container keeping the same uid/gid, ending up in:

root@alexnet-tf-gpu-pod:/# ls -l /dev/kfd 
crw-rw---- 1 root 106 242, 0 Apr 18 15:58 /dev/kfd

root@alexnet-tf-gpu-pod:/# ls -l /dev/dri/renderD128 
crw-rw---- 1 root 106 226, 128 Apr 18 15:58 /dev/dri/renderD128

The gid 106 is the render group on dse-k8s-worker. so I am not really sure how to map this number to the OS in the container to grant proper access.

Opened https://github.com/RadeonOpenCompute/k8s-device-plugin/issues/39 to upstream to get some feedback.

Change 909968 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] amd_gpu: add udev rules to bypass the 'render' group

https://gerrit.wikimedia.org/r/909968

Change 909969 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role:dse_k8s::worker: set allow_gpu_broader_access

https://gerrit.wikimedia.org/r/909969

Change 909970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add more ROCm packages

https://gerrit.wikimedia.org/r/909970

Change 909970 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add more ROCm packages

https://gerrit.wikimedia.org/r/909970

Change 910419 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: replace rocblas with rocblas-dev

https://gerrit.wikimedia.org/r/910419

Change 910419 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: replace rocblas with rocblas-dev

https://gerrit.wikimedia.org/r/910419

Change 909968 merged by Elukey:

[operations/puppet@production] amd_gpu: add udev rules to bypass the 'render' group

https://gerrit.wikimedia.org/r/909968

Change 909969 merged by Elukey:

[operations/puppet@production] role:dse_k8s::worker: set allow_gpu_broader_access

https://gerrit.wikimedia.org/r/909969

Maintenance_bot removed a project: Patch-For-Review.Apr 20 2023, 2:14 PM

elukey mentioned this in T327923: Investigate procuring and installing GPUs on Lift Wing.Apr 20 2023, 3:10 PM

Change 912240 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::builder: add ml-runner user

https://gerrit.wikimedia.org/r/912240

gerritbot added a project: Patch-For-Review.Apr 26 2023, 9:06 AM

Change 910743 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: reduce image size

https://gerrit.wikimedia.org/r/910743

I opened T335177 since the amd-gpu-test image size got up to 14G (uncompressed), and nginx on the docker registry nodes has a limit of 2G (compressed) so it eventually causes the upload request to timeout.

After a chat with Janis I failed back to another solution: mount /opt/rocm-5.4.0 to the pod in read-only mode, removing all the ROCm packages from the docker image.

Change 912240 merged by Elukey:

[operations/puppet@production] role::builder: add ml-runner user

https://gerrit.wikimedia.org/r/912240

Change 910743 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: reduce image size

https://gerrit.wikimedia.org/r/910743

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 1:30 PM

Change 912300 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add libelf-dev to the package list

https://gerrit.wikimedia.org/r/912300

Change 912300 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add libelf-dev to the package list

https://gerrit.wikimedia.org/r/912300

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 2:10 PM

Change 912313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] amd-gpu-tester: add librdm

https://gerrit.wikimedia.org/r/912313

Change 912313 merged by Elukey:

[operations/docker-images/production-images@master] amd-gpu-tester: add librdm

https://gerrit.wikimedia.org/r/912313

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 3:10 PM

Finally I was able to run the alexnet tensorflow test on a DSE GPU:

TensorFlow:  2.11
Model:       alexnet
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             512 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['/gpu:0']    <=========================================
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Step	Img/sec	total_loss
1	images/sec: 1190.3 +/- 0.0 (jitter = 0.0)	7.263
10	images/sec: 1183.2 +/- 2.1 (jitter = 9.8)	7.263
20	images/sec: 1177.6 +/- 2.0 (jitter = 9.5)	7.263
30	images/sec: 1177.7 +/- 1.6 (jitter = 9.6)	7.263
40	images/sec: 1177.7 +/- 1.3 (jitter = 8.5)	7.263
50	images/sec: 1177.1 +/- 1.2 (jitter = 9.5)	7.263
60	images/sec: 1176.4 +/- 1.2 (jitter = 9.6)	7.263
70	images/sec: 1175.8 +/- 1.1 (jitter = 9.4)	7.263
80	images/sec: 1175.3 +/- 1.0 (jitter = 10.2)	7.263
90	images/sec: 1174.7 +/- 1.0 (jitter = 10.5)	7.263
100	images/sec: 1174.3 +/- 0.9 (jitter = 10.2)	7.263
----------------------------------------------------------------
total images/sec: 1174.12
----------------------------------------------------------------

Change 912336 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: add support for the K8s device plugin on DSE

https://gerrit.wikimedia.org/r/912336

gerritbot added a project: Patch-For-Review.Apr 26 2023, 4:30 PM

Change 912336 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: add support for the K8s device plugin on DSE

https://gerrit.wikimedia.org/r/912336

The task is done, we have successfully configured and run a job on a GPU on DSE! All the configs are also puppetized so we can apply the same to any Lift Wing node anytime (if they get a GPU).

elukey mentioned this in T295661: Upgrade ROCm to 5.4.Apr 26 2023, 4:39 PM

elukey moved this task from In Progress to Blocked on the Machine-Learning-Team board.

elukey moved this task from Blocked to Complete Q3 2022/23 on the Machine-Learning-Team board.

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 5:10 PM

elukey closed this task as Resolved.May 15 2023, 3:14 PM

BTullis mentioned this in T329360: Upgrade stat1008 to bullseye.May 16 2023, 4:40 PM

Review and test the AMD GPU kubernetes pluginClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Review and test the AMD GPU kubernetes plugin
Closed, ResolvedPublic
Actions

Related Objects
Search...