Page MenuHomePhabricator

Put 6 GPU-based Hadoop worker in service
Closed, ResolvedPublic

Description

This task is blocked until the related rack/setup/deploy one is completed.

Event Timeline

elukey created this task.Jun 11 2020, 1:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2020, 1:30 PM
Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Change 630861 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add profile::hadoop::worker::gpu to Hadoop workers' role

https://gerrit.wikimedia.org/r/630861

elukey added subscribers: klausman, razzi.

There are 6 workers with GPUs: an-worker1096->1101

Currently we have only an-worker1096 running in the Hadoop cluster, but without any GPU configured. This was mostly done to test the 4.19 linux kernel on Stretch with Hadoop workloads, that didn't reveal any issue. The next steps should be something like the following:

  1. add an APT component for rocm33 for stretch-wikimedia, and pull/copy packages to it. The current packages that we use are built by AMD for Ubuntu Xenial, that is more inline with Buster than Stretch, so some tests are needed.
  2. deploy the AMD stack on an-worker1097 with an ad-hoc role that only deploys the minum to have a GPU environment working correctly (so not role::analytics_cluster::hadoop::worker like the other workers, otherwise the node will try to join the cluster as soon as puppet completes the first run).
  3. test some tensorflow-rocm or similar on an-worker1097 and make sure that the drivers/tools/etc.. work on Stretch.
  4. deploy https://gerrit.wikimedia.org/r/630861 or something along those lines
  5. add support for gpu to an-worker1096

@klausman would it be something interesting for you to do while working on other tasks? It seems inline with all the work that you have done on stat100x, and it will give you some info about hadoop puppet configs etc.. Let me know :)

Change 631153 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allow deployment of AMD ROCm drivers on Stretch

https://gerrit.wikimedia.org/r/631153

Change 631153 merged by Elukey:
[operations/puppet@production] Allow deployment of AMD ROCm drivers on Stretch

https://gerrit.wikimedia.org/r/631153

I manually installed the rocm packages on an-worker1097, rebooted and tested with a simple tensorflow script, but the GPU is not recognized. I think that Stretch is missing the render group, and this is the result (see also the diff with stat1008):

elukey@an-worker1097:~$ ls -l /dev/kfd
crw------- 1 root root 242, 0 Sep 30 14:38 /dev/kfd

elukey@stat1008:~$ ls -l /dev/kfd
crw-rw---- 1 root render 240, 0 Sep 16 12:08 /dev/kfd
elukey@an-worker1097:~$ /opt/rocm-3.3.0/bin/rocminfo
ROCk module is loaded
elukey is member of video group
hsa api call failure at: /data/jenkins-workspace/compute-rocm-rel-3.3/rocminfo/rocminfo.cc:1102
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

@MoritzMuehlenhoff for this use case I'd try to add component/systemd241 that should bring in the render group (since we also have automation to add automatically analytics-privatedata-users in it). Otherwise we could create a simple udev rule to make kfd accessible by analytics-privatedata-users ? Something like (taken from upstream examples):

echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="analytics-privatedata-users"' | sudo tee /etc/udev/rules.d/70-kfd.rules

Any preference?

Tried with:

elukey@an-worker1097:~$ cat /etc/udev/rules.d/70-kfd.rules
SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"

And it worked fine (I added myself to video before of course)

Change 631425 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Generalize profile::statistics::gpu

https://gerrit.wikimedia.org/r/631425

Change 631425 merged by Elukey:
[operations/puppet@production] Generalize profile::statistics::gpu

https://gerrit.wikimedia.org/r/631425

Change 631444 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] amd_rocm: add rock-dkms package

https://gerrit.wikimedia.org/r/631444

Change 631444 merged by Elukey:
[operations/puppet@production] amd_rocm: add rock-dkms package

https://gerrit.wikimedia.org/r/631444

Change 631447 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker1097 as hadoop worker

https://gerrit.wikimedia.org/r/631447

Change 631447 merged by Elukey:
[operations/puppet@production] Add an-worker1097 as hadoop worker

https://gerrit.wikimedia.org/r/631447

Change 631452 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] amd_rocm Prometheus script: hard-code Py3.7 usage

https://gerrit.wikimedia.org/r/631452

Change 631452 merged by Klausman:
[operations/puppet@production] amd_rocm Prometheus script: hard-code Py3.7 usage

https://gerrit.wikimedia.org/r/631452

Change 631461 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker109[78] as Hadoop workers

https://gerrit.wikimedia.org/r/631461

Change 631461 merged by Elukey:
[operations/puppet@production] Set an-worker109[78] as Hadoop workers

https://gerrit.wikimedia.org/r/631461

Change 631683 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set an-worker110[0-2] as Hadoop workers

https://gerrit.wikimedia.org/r/631683

Change 631683 merged by Elukey:
[operations/puppet@production] Set an-worker110[0-2] as Hadoop workers

https://gerrit.wikimedia.org/r/631683

elukey added a comment.Oct 2 2020, 9:22 AM

All nodes joined the cluster, now we only need to reboot them (one by one) to enable the GPUs (some settings need a reboot).

After this we'll need to find a way to use GPU workers in Yarn, possibly with labels? Will open a new task.

Change 630861 abandoned by Elukey:
[operations/puppet@production] Add profile::hadoop::worker::gpu to Hadoop workers' role

Reason:

https://gerrit.wikimedia.org/r/630861

Last step before closing is to reboot the workers that don't have yet the /dev/kfd device working.

Change 633766 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.reboot-workers: allow to limit workers to reboot

https://gerrit.wikimedia.org/r/633766

Change 633766 merged by Elukey:
[operations/cookbooks@master] sre.hadoop.reboot-workers: allow to limit workers to reboot

https://gerrit.wikimedia.org/r/633766

Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:56:06Z] <elukey> drain + reboot an-worker109[8,9] to pick up GPU settings - T255138

Mentioned in SAL (#wikimedia-operations) [2020-10-14T15:29:03Z] <elukey> drain + reboot an-worker110[1,2] to pick up GPU settings - T255138

Mentioned in SAL (#wikimedia-operations) [2020-10-14T15:59:02Z] <elukey> drain + reboot an-worker1100 to pick up GPU settings - T255138

elukey claimed this task.Oct 14 2020, 4:15 PM
elukey triaged this task as Medium priority.
elukey set Final Story Points to 13.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
elukey moved this task from Q2 2020/2021 to Done on the Analytics-Clusters board.Oct 27 2020, 4:40 PM
fdans closed this task as Resolved.Oct 29 2020, 9:23 PM