Page MenuHomePhabricator

Test if we can avoid ROCm debian packages on k8s nodes
Closed, ResolvedPublic3 Estimated Story Points

Description

In T362984 we are investigating some issues with GPUs on k8s, and something came up: do we need to deploy ROCm Debian packages on k8s nodes, or can we just rely on the libs shipped by Pytorch and similar?

Rationale: On k8s nodes we deploy ROCm libs (~10G+) that shouldn't be used, since the only GPU workloads are through containers. We have a device plugin that exposes the GPU device to the kubelet (that in turn exposes it to the containers that request it), but it relies only on the Linux Kernel driver to recognize the device and nothing more.

We should try to remove ROCm libs from ml-staging2001, and test if we can just use ROCm libs with Pytorch. This would simplify a lot our life, since we'll need to update our internal APT repos only for the training infra (stat nodes etc.., so bare metal nodes that actually use those packages).

Event Timeline

The only issue that I see from puppet is that prometheus::node_amd_rocm uses rocm smi to get info about what GPU to monitor.

elukey@stat1010:~$ dpkg -S rocm-smi 
rocm-smi-lib: /opt/rocm-5.4.0/bin/rocm-smi

elukey@stat1010:~$ apt-cache show rocm-smi-lib | grep Depends
Depends: python3, rocm-core

This is not good, rocm-smi requires rocm-core, so installing the package will bring in more stuff (basically most of the other packages). We need to find a way to avoid rocm-smi, if possible, to monitor the GPUs.

https://packages.debian.org/bookworm/rocm-smi
https://packages.debian.org/source/bookworm/rocm-smi-lib

The above are probably a good drop-in replacement, but they are available from Bookworm onward and we are on Bullseye :(

After a chat with Tobias, we are going to test this:

  • disable puppet on ml-staging2001
  • remove all ROCm packages
  • reboot
  • test running a pod requiring a GPU and make sure that it works
  • run puppet (that will redeploy packages)

If the test works we could drop those rocm packages entirely when we'll move to bookworm.

isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Change #1032506 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: refactor configurations for k8s nodes

https://gerrit.wikimedia.org/r/1032506

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

So far:

  • amd-k8s-device-plugin was copied to bookworm
  • kubelet is present for bookworm (another version though)
  • rsyslog-kubernetes is not present in bookworm-wikimedia, so we'll need to build it.

Change #1032506 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: refactor configurations for k8s nodes

https://gerrit.wikimedia.org/r/1032506

Change #1032765 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Skip ROCm packages for ml-staging2001

https://gerrit.wikimedia.org/r/1032765

In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it.

So far:

  • amd-k8s-device-plugin was copied to bookworm
  • kubelet is present for bookworm (another version though)
  • rsyslog-kubernetes is not present in bookworm-wikimedia, so we'll need to build it.

Correction - rsyslog-kubernetes is now shipped by Debian so we are good, the kubelet package that I looked up on apt.wikimedia.org is not the right one, since we need kubernetes-node, that is not present for Bookworm Wikimedia :(

We are also missing calico and istio-cni.

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host ml-staging2001.codfw.wmnet with OS bookworm

Change #1032765 merged by Elukey:

[operations/puppet@production] Skip ROCm packages for ml-staging2001

https://gerrit.wikimedia.org/r/1032765

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host ml-staging2001.codfw.wmnet with OS bookworm completed:

  • ml-staging2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405221522_elukey_2123217_ml-staging2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Everything seems to work as expected, the ROCm packages are not needed!