Thanks to several people, we now have the following settings in the DSE cluster:
- k8s 1.23
- A ROCm-compatible AMD GPU (thanks Ben!)
We should now be able to review and test https://github.com/RadeonOpenCompute/k8s-device-plugin
Main things to check:
- The security model of the plugin seems to require a daemonset deployed on all nodes with high privileges. We should follow up with ServiceOps to understand what best practice we should follow.
- Is the support for labeling ok? Not all nodes will have GPUs, so we'll need to be able to schedule pods only on the ones in need of it.
- Does it play well with ROCm drivers?
- Should we use the upstream helm chart or something different?
Ideally at the end we should be able to run a simple app using the GPU on a DSE cluster pod.