This task is blocked until the related rack/setup/deploy one is completed.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T244211 Analytics Hardware for Fiscal Year 2019/2020 | |||
Resolved | Ottomata | T243521 Hadoop Hardware Orders FY2019-2020 | |||
Resolved | elukey | T255138 Put 6 GPU-based Hadoop worker in service |
Event Timeline
Change 630861 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add profile::hadoop::worker::gpu to Hadoop workers' role
There are 6 workers with GPUs: an-worker1096->1101
Currently we have only an-worker1096 running in the Hadoop cluster, but without any GPU configured. This was mostly done to test the 4.19 linux kernel on Stretch with Hadoop workloads, that didn't reveal any issue. The next steps should be something like the following:
- add an APT component for rocm33 for stretch-wikimedia, and pull/copy packages to it. The current packages that we use are built by AMD for Ubuntu Xenial, that is more inline with Buster than Stretch, so some tests are needed.
- deploy the AMD stack on an-worker1097 with an ad-hoc role that only deploys the minum to have a GPU environment working correctly (so not role::analytics_cluster::hadoop::worker like the other workers, otherwise the node will try to join the cluster as soon as puppet completes the first run).
- test some tensorflow-rocm or similar on an-worker1097 and make sure that the drivers/tools/etc.. work on Stretch.
- deploy https://gerrit.wikimedia.org/r/630861 or something along those lines
- add support for gpu to an-worker1096
@klausman would it be something interesting for you to do while working on other tasks? It seems inline with all the work that you have done on stat100x, and it will give you some info about hadoop puppet configs etc.. Let me know :)
Change 631153 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allow deployment of AMD ROCm drivers on Stretch
Change 631153 merged by Elukey:
[operations/puppet@production] Allow deployment of AMD ROCm drivers on Stretch
I manually installed the rocm packages on an-worker1097, rebooted and tested with a simple tensorflow script, but the GPU is not recognized. I think that Stretch is missing the render group, and this is the result (see also the diff with stat1008):
elukey@an-worker1097:~$ ls -l /dev/kfd crw------- 1 root root 242, 0 Sep 30 14:38 /dev/kfd elukey@stat1008:~$ ls -l /dev/kfd crw-rw---- 1 root render 240, 0 Sep 16 12:08 /dev/kfd
elukey@an-worker1097:~$ /opt/rocm-3.3.0/bin/rocminfo ROCk module is loaded elukey is member of video group hsa api call failure at: /data/jenkins-workspace/compute-rocm-rel-3.3/rocminfo/rocminfo.cc:1102 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
@MoritzMuehlenhoff for this use case I'd try to add component/systemd241 that should bring in the render group (since we also have automation to add automatically analytics-privatedata-users in it). Otherwise we could create a simple udev rule to make kfd accessible by analytics-privatedata-users ? Something like (taken from upstream examples):
echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="analytics-privatedata-users"' | sudo tee /etc/udev/rules.d/70-kfd.rules
Any preference?
Tried with:
elukey@an-worker1097:~$ cat /etc/udev/rules.d/70-kfd.rules SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"
And it worked fine (I added myself to video before of course)
Change 631425 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Generalize profile::statistics::gpu
Change 631425 merged by Elukey:
[operations/puppet@production] Generalize profile::statistics::gpu
Change 631444 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] amd_rocm: add rock-dkms package
Change 631444 merged by Elukey:
[operations/puppet@production] amd_rocm: add rock-dkms package
Change 631447 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker1097 as hadoop worker
Change 631447 merged by Elukey:
[operations/puppet@production] Add an-worker1097 as hadoop worker
Change 631452 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd_rocm Prometheus script: hard-code Py3.7 usage
Change 631452 merged by Klausman:
[operations/puppet@production] amd_rocm Prometheus script: hard-code Py3.7 usage
Change 631461 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker109[78] as Hadoop workers
Change 631461 merged by Elukey:
[operations/puppet@production] Set an-worker109[78] as Hadoop workers
Change 631683 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker110[0-2] as Hadoop workers
Change 631683 merged by Elukey:
[operations/puppet@production] Set an-worker110[0-2] as Hadoop workers
All nodes joined the cluster, now we only need to reboot them (one by one) to enable the GPUs (some settings need a reboot).
After this we'll need to find a way to use GPU workers in Yarn, possibly with labels? Will open a new task.
Change 630861 abandoned by Elukey:
[operations/puppet@production] Add profile::hadoop::worker::gpu to Hadoop workers' role
Reason:
Last step before closing is to reboot the workers that don't have yet the /dev/kfd device working.
Change 633766 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.reboot-workers: allow to limit workers to reboot
Change 633766 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.reboot-workers: allow to limit workers to reboot
Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:56:06Z] <elukey> drain + reboot an-worker109[8,9] to pick up GPU settings - T255138
Mentioned in SAL (#wikimedia-operations) [2020-10-14T15:29:03Z] <elukey> drain + reboot an-worker110[1,2] to pick up GPU settings - T255138
Mentioned in SAL (#wikimedia-operations) [2020-10-14T15:59:02Z] <elukey> drain + reboot an-worker1100 to pick up GPU settings - T255138