This task is blocked until the related rack/setup/deploy one is completed.
There are 6 workers with GPUs: an-worker1096->1101
Currently we have only an-worker1096 running in the Hadoop cluster, but without any GPU configured. This was mostly done to test the 4.19 linux kernel on Stretch with Hadoop workloads, that didn't reveal any issue. The next steps should be something like the following:
- add an APT component for rocm33 for stretch-wikimedia, and pull/copy packages to it. The current packages that we use are built by AMD for Ubuntu Xenial, that is more inline with Buster than Stretch, so some tests are needed.
- deploy the AMD stack on an-worker1097 with an ad-hoc role that only deploys the minum to have a GPU environment working correctly (so not role::analytics_cluster::hadoop::worker like the other workers, otherwise the node will try to join the cluster as soon as puppet completes the first run).
- test some tensorflow-rocm or similar on an-worker1097 and make sure that the drivers/tools/etc.. work on Stretch.
- deploy https://gerrit.wikimedia.org/r/630861 or something along those lines
- add support for gpu to an-worker1096
@klausman would it be something interesting for you to do while working on other tasks? It seems inline with all the work that you have done on stat100x, and it will give you some info about hadoop puppet configs etc.. Let me know :)
I manually installed the rocm packages on an-worker1097, rebooted and tested with a simple tensorflow script, but the GPU is not recognized. I think that Stretch is missing the render group, and this is the result (see also the diff with stat1008):
elukey@an-worker1097:~$ ls -l /dev/kfd crw------- 1 root root 242, 0 Sep 30 14:38 /dev/kfd elukey@stat1008:~$ ls -l /dev/kfd crw-rw---- 1 root render 240, 0 Sep 16 12:08 /dev/kfd
elukey@an-worker1097:~$ /opt/rocm-3.3.0/bin/rocminfo ROCk module is loaded elukey is member of video group hsa api call failure at: /data/jenkins-workspace/compute-rocm-rel-3.3/rocminfo/rocminfo.cc:1102 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
@MoritzMuehlenhoff for this use case I'd try to add component/systemd241 that should bring in the render group (since we also have automation to add automatically analytics-privatedata-users in it). Otherwise we could create a simple udev rule to make kfd accessible by analytics-privatedata-users ? Something like (taken from upstream examples):
echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="analytics-privatedata-users"' | sudo tee /etc/udev/rules.d/70-kfd.rules
elukey@an-worker1097:~$ cat /etc/udev/rules.d/70-kfd.rules SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
And it worked fine (I added myself to video before of course)
All nodes joined the cluster, now we only need to reboot them (one by one) to enable the GPUs (some settings need a reboot).
After this we'll need to find a way to use GPU workers in Yarn, possibly with labels? Will open a new task.