Page MenuHomePhabricator

Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Open, LowPublic

Description

From @ellery:

I have been repeatedly running into computational bottlenecks training machine learning models over the last few months. Almost all popular ML libraries offer GPU support, which can speed up model training by orders of magnitude. I was debating asking for a new personal machine with a powerful GPU, but I think it makes the most sense to install a new GPU on one of the stat machines. Its far cheaper, and we can all share the resource. I talked to Andrew and he thinks it's relatively easy to install. The current blocker is to get funding. The current top of the line Nvidia GPU is $1200.

The request has been reviewed by @Nuria and myself, is approved on my end and can be covered on Research budget.

@mark: any concern/additional questions from your team?

@ellery @Ottomata: can you guys 1) add the specs, 2) confirm with DC-Ops if the desired hardware physically fits in the box, 3) clarify on which stat machine you would like to have it installed.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Opened T216226 to discuss hw requirements for the new GPU (everybody interested please subscribe/chime-in!), let's use this task to debug the drivers on stat1005 :)

Thanks to Moritz we have buster back on stat1005. I did the following:

  • Added radeon.cik_support=0 amdgpu.cik_support=1 to grub.cfg so we can reboot anytime and get the drivers working properly
  • Followed the Roc install guide with Linux kernel drivers, namely installing rocm-dev rocm-libs miopen-hip cxlactivitylogger (last two needed by tensorflow)
  • Added udev rules echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules and added myself/Erik's usernames in the video group (to avoid using sudo to use kfd).
  • Added AMD's debian repo manually (and disabled puppet to prevent them to be wiped, going to fix it asap).

/opt/rocm/bin/rocminfo works, but /opt/rocm/opencl/bin/x86_64/clinfo leads to https://phabricator.wikimedia.org/T148843#4950469.

https://github.com/RadeonOpenCompute/ROCm/issues/640 seems describing a similar issue (but with the ROCm drivers), it might help.

Change 491263 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Disable notifications for stat1005 while testing

https://gerrit.wikimedia.org/r/491263

Change 491263 merged by Elukey:
[operations/puppet@production] Disable notifications for stat1005 while testing

https://gerrit.wikimedia.org/r/491263

Very promising:

https://github.com/RadeonOpenCompute/ROCm/issues/702#issuecomment-461982554

As noted in #691 and #640, Hawaii GPUs (such as your FirePro W8100) are currently broken on ROCm 2.0. We are working to fix this, but these fixes did not make it in to ROCm 2.1. If you want to stick with ROCm 2.1, I would recommend removing your Hawaii GPU for now. Alternately, you can roll back to a previous version (e.g. using the roc-1.9.2 branch of our Experimental ROC installation scripts).

elukey added a comment.EditedFeb 18 2019, 6:46 PM

Tried to purge rocm-dev 2.1 and install 1.9.2, same problem: /opt/rocm/opencl/bin/x86_64/clinfo hangs and when I hit control-C (not successfully) I can see the following in dmesg.

[   90.690958] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[   90.699698] PGD 0 P4D 0
[   90.702524] Oops: 0000 [#1] SMP PTI
[   90.706416] CPU: 3 PID: 2581 Comm: clinfo Not tainted 4.19.0-2-amd64 #1 Debian 4.19.16-1
[   90.715445] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[   90.723913] RIP: 0010:amdgpu_ib_schedule+0x50/0x550 [amdgpu]
[   90.730227] Code: 89 f5 49 89 fe 48 89 54 24 08 0f b6 87 1c 02 00 00 48 85 c9 0f 84 68 03 00 00 48 8b b9 d0 00 00 00 48 8b 51 10 48 89 7c 24 18 <48> 8b 7a 38 48 89 3c 24 84 c0 0f 84 ac 04 00 00 48 83 7c 24 18 00
[   90.751183] RSP: 0018:ffffb3cb87733a90 EFLAGS: 00010286
[   90.757012] RAX: 0000000000000001 RBX: ffff9f6a9a056800 RCX: ffff9f6a9a056800
[   90.764975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[   90.772937] RBP: 0000000000000001 R08: ffffb3cb87733b08 R09: ffff9f6a9a056800
[   90.780898] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f6298b30000
[   90.788860] R13: 0000000000ffd000 R14: ffff9f6298b34f88 R15: ffff9f6299f7b000
[   90.796823] FS:  00007efca6e2e700(0000) GS:ffff9f6a9f640000(0000) knlGS:0000000000000000
[   90.805853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   90.812264] CR2: 0000000000000038 CR3: 000000092a40a005 CR4: 00000000003606e0
[   90.820226] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   90.828189] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   90.836151] Call Trace:
[   90.838883]  ? __kmalloc+0x177/0x210
[   90.842932]  amdgpu_amdkfd_submit_ib+0xb3/0x170 [amdgpu]
[   90.848880]  deallocate_vmid.isra.10+0xe6/0x100 [amdkfd]
[   90.854817]  destroy_queue_nocpsch_locked+0x185/0x1c0 [amdkfd]
[   90.861332]  process_termination_nocpsch+0x61/0x130 [amdkfd]
[   90.867653]  kfd_process_dequeue_from_all_devices+0x3b/0x50 [amdkfd]
[   90.874749]  kfd_process_notifier_release+0xe4/0x170 [amdkfd]
[   90.881164]  __mmu_notifier_release+0x42/0xc0
[   90.886025]  exit_mmap+0x33/0x180
[   90.889724]  ? __get_user_8+0x21/0x2b
[   90.893807]  ? kmem_cache_free+0x1a7/0x1d0
[   90.898377]  mmput+0x54/0x130
[   90.901688]  do_exit+0x284/0xb20
[   90.905289]  do_group_exit+0x3a/0xa0
[   90.909279]  get_signal+0x27c/0x590
[   90.913173]  do_signal+0x36/0x6d0
[   90.916866]  ? check_preempt_curr+0x7a/0x90
[   90.921537]  ? __do_page_fault+0x26c/0x4f0
[   90.926109]  exit_to_usermode_loop+0x89/0xf0
[   90.930874]  do_syscall_64+0xfd/0x100
[   90.934960]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   90.940595] RIP: 0033:0x7efcad018fbc
[   90.944586] Code: Bad RIP value.

This is the result of attaching gdb with thread apply all bt to the handing clinfo process (high CPU usage):

[.. cut up to Thread 42 due to the same stacktrace as the one below ..]

Thread 4 (Thread 0x7f35f65b7700 (LWP 5433)):
#0  0x00007f35fcf61fbc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f35fb577aad in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007f35fb53126a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f35f6db8700 (LWP 5432)):
#0  0x00007f35fcf61fbc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f35fb577aad in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007f35fb530f12 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f35f80a7700 (LWP 5431)):
#0  0x00007f35fce81757 in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f35f87cb0b8 in ?? () from /opt/rocm/lib/libhsakmt.so.1
#2  0x00007f35f87c4c8f in hsaKmtWaitOnMultipleEvents () from /opt/rocm/lib/libhsakmt.so.1
#3  0x00007f35f8a36453 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f35f8a211a6 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f35f8a31ae2 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f35f89f6637 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f35fcd8e740 (LWP 5396)):
#0  0x00007f35fce71bf7 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f35f8a09225 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2  0x00007f35f8a01f26 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#3  0x00007f35f8a0ba9e in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f35f8a3e155 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f35f8a3e86e in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f35f8a421cd in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f35f8a1b147 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#8  0x00007f35f921c60a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9  0x00007f35f921e4ac in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007f35f91f1c5f in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007f35f91e3ace in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007f35f91eef92 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007f35f9213104 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#14 0x00007f35f9213f95 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#15 0x00007f35f91ec9d3 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#16 0x00007f35f91eb137 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#17 0x00007f35f91c138a in clIcdGetPlatformIDsKHR () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#18 0x00007f35fcf9198b in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#19 0x00007f35fcf93907 in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#20 0x00007f35fcf63947 in __pthread_once_slow () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00007f35fcf91f21 in clGetPlatformIDs () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#22 0x000000000040f617 in ?? ()
#23 0x0000000000407c12 in ?? ()
#24 0x00007f35fcdb509b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#25 0x000000000040e6d1 in ?? ()

Output of top -H -p 5396

5396 elukey    20   0  258.9g  72152  50876 R  99.9   0.1   3:53.34 clinfo
5431 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5432 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5433 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5434 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5435 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5436 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5437 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5438 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5439 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5440 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5441 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5442 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5443 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5444 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5445 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5446 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5447 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5448 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5449 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5450 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5451 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5452 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5453 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5454 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5455 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5456 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5457 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5458 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5459 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5460 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5461 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5462 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5463 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5464 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5465 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5466 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5467 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5468 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5469 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5470 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5471 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo

The strace of the process basically consuming one CPU (70+% system) shows this:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
...

In the dmesg I can see a lot of:

[Tue Feb 19 07:29:16 2019] amdgpu: [powerplay]
                            failed to send message 282 ret is 254

But still no null pointer exception. Then if I hit control+C to the clinfo process I get BUG: unable to handle kernel NULL pointer dereference at 0000000000000038.

https://github.com/RadeonOpenCompute/ROCm/issues/482 is a very similar problem, so I tried a couple of suggestions in here:

  • export HSA_ENABLE_SDMA=0; cliinfo -> still hangs
  • compile /opt/rocm/hip/bin/hipInfo -> hangs
  • compile /opt/rocm/hip/samples/0_Intro/square -> returns the following and hangs: ### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_OUT_OF_RESOURCES (0x1008) at file:mcwamp_hsa.cpp line:1218

The user reporting the GH issue was running NixOS and "solved" the issue installing Ubuntu. At this point it seems to me that the issue might lie in either Kernel drivers or in rocm itself (running on non fully supported platforms).

What do you think about opening a GH issue to ROCm first to (hopefully) get some feedback?

What do you think about opening a GH issue to ROCm first to (hopefully) get some feedback?

Sounds good, but I wouldn't hold my breath that we'll see usable support for the current gfx7, I think it makes sense to focus on getting a new GPU.

elukey added a comment.EditedFeb 20 2019, 12:23 PM

Quick update:

amdgpu.dc=0 (set to 1 by default on 4.17+) fixes the following errors:

[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Error in phm_get_clock_info
[Tue Feb 19 08:35:06 2019] [drm:dc_create [amdgpu]] *ERROR* DC: Number of connectors is zero!

amdgpu.dpm=1 leads to:

[Wed Feb 20 12:15:15 2019] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
[Wed Feb 20 12:15:15 2019] [drm:amdgpu_device_init.cold.28 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: amdgpu_device_ip_init failed
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: Fatal error during GPU init

The following is still happening:

[Wed Feb 20 12:20:16 2019] amdgpu: [powerplay]
                            failed to send message 282 ret is 254

Updates from https://github.com/RadeonOpenCompute/ROCm/issues/714#issuecomment-465666946 are not encouraging, gfx701 is a dead end so we should buy a new card asap :(

elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 25 2019, 8:44 AM
Nuria renamed this task from GPU upgrade for stats machine to Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models.Mar 14 2019, 6:53 PM
elukey changed the task status from Open to Stalled.Mar 28 2019, 9:21 AM
elukey changed the status of subtask T216226: GPU upgrade for stat1005 from Open to Stalled.

All the info tracked in T216226. We are going to buy a AMD Radeon Pro WX 9100 16GB. Setting this task pending procurement of the new hardware.

elukey changed the task status from Stalled to Open.Apr 2 2019, 5:15 PM
elukey closed subtask T216226: GPU upgrade for stat1005 as Resolved.

Vega GPU mounted on stat1005, it looks good from a first round of tests!

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-479082013 says that tensorflow-rocm should be available for Python 3.7 soon on pypi, another good news :)

hacks abound, but basically:

  • Added deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main to sources.list.d/rocm.list
  • upgraded all the rocm packages to latest version
  • Built a python 3.6 virtualenv on stat1007 including tensorflow-rocm and tensorflow_hub
  • Copied virtualenv to stat1005

This appears to successfully run a few different models I tried: tf mnist demo, elmo from tf_hub, and miriam's image quality model.

No clue on performance yet, to do that we need to setup some model we might care about with a performant data input pipeline(tfrecords, as opposed to placeholders and feed_dict's most likely)

Nuria added a subscriber: Gilles.Apr 2 2019, 8:02 PM

ping @Miriam @Gilles so they know the status of this.

elukey added a comment.EditedApr 3 2019, 7:42 AM

I think that we should move away from hacks done up to now and start adding to puppet the config that we are using.

Things do to from the SRE side:

  • add a new component for Debian Buster for the rocm packages, and see what to do with https://github.com/RadeonOpenCompute/ROCm#closed-source-components (it seems only one package, hopefully it is not strictly needed).
  • modify the gpu-testers group to avoid sudo rules and add more people to it (Miriam/Gilles are the first ones that comes up in my mind). In this way more people will have access to stat1005 and will be able to provide feedback.
  • see if it is possible to make an apt component like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ for Debian buster, to have python 3.6 available (3.7 support should come soon but not sure when).

About the closed source package:

elukey@stat1005:~$ apt-cache rdepends hsa-ext-rocr-dev
hsa-ext-rocr-dev
Reverse Depends:
  hsa-rocr-dev
  rocm-dev
  hcc

That should be straightforward, python3.6 was only recently removed from Debian testing and we should be able to simply fetch the packages from snapshot.debian.org:
https://packages.qa.debian.org/p/python3.6/news/20190330T191602Z.html

elukey added a comment.Apr 3 2019, 8:37 AM

@EBernhardson I think that the most pressing point now is to decide/test if we need hsa-ext-rocr-dev (the only package containing binary only libs). I gathered some info about the package:

https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-422174043
https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140

IIUC this library is aimed for direct image support in OpenCL, not sure if this is mandatory for us or not. From what upstream says the whole set of libs should work without any issue, so we could simply import all the packages from the ROCM repos to our reprepro and then add a dummy package for hsa-ext-rocr-dev to satisfy deb dependencies (see rdepends output above).

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

Something like that sounds great. Mainly I want to ensure that training is generating reasonable outputs, the only training test I've done so far is the mnist example which is very much a toy problem. Image's might also allow us to test if the hsa-ext-rocr-dev package brought up by @elukey is going to be necessary, as it seems that is an image support lib of some sort.

Miriam added a comment.Apr 3 2019, 7:55 PM

OK, I can prepare a task for this, or we can start from something like this maybe?
https://gist.github.com/omoindrot/dedc857cdc0e680dfb1be99762990c9c/

Nuria added a comment.Apr 3 2019, 8:19 PM

I wonder if we can use fashion mist to benchmark: https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/ as there seems to be a bunch of people using it as a replacement of original mist. Could we set up here training with/without GPU?
https://www.tensorflow.org/guide/using_gpu

elukey added a comment.Apr 4 2019, 8:05 AM

@EBernhardson on stat1005 mivisionx was causing a broken apt, so after reading https://github.com/RadeonOpenCompute/ROCm/issues/350#issuecomment-370100523 I have done the following:

  • removed mivisionx to fix apt
  • removed miopen-hip
  • installed mivisionx (that brings in miopen-opencl, the one conflicting with miopen-hip for header files)

Change 501156 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: add gpu-users group and assign it to stat1005

https://gerrit.wikimedia.org/r/501156

EBernhardson added a comment.EditedApr 4 2019, 4:54 PM

With the changes in packages now trying to run any model returns:

ebernhardson@stat1005:~$ /home/ebernhardson/tf_venv/bin/python /home/ebernhardson/mnist.py
ImportError: /opt/rocm/lib/libMIOpen.so.1: version `MIOPEN_HIP_1' not found (required by /home/ebernhardson/tf_venv/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

Not sure which packages we want exactly, I would mostly just be doing trial and error.

elukey added a comment.Apr 4 2019, 4:58 PM

A solution could be to remove mivisionx (not sure if needed) and restore miopen-hip

per https://github.com/RadeonOpenCompute/ROCm/issues/703#issuecomment-462598966

that means no, miopen-opencl functionality is not supported within TF.

Removed miopen-opencl and mivisionx, installed miopen-hip and training looks to be working now.

For benchmarking I poked around some more and found https://github.com/lambdal/lambda-tensorflow-benchmark (related: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/). I've started up that set of benchmarks and will report back with results. These are all image tasks afaict, will basically compare loss and images/sec to get an idea about how the card is working out.

I've noticed while doing this we might want to setup some new prometheus metrics to collect. In particular /opt/rocm/bin/rocm-smi reports gpu temp, power usage, fan% and gpu usage %. These all might be useful to record in prometheus.

EBernhardson added a comment.EditedApr 4 2019, 11:05 PM

Synthetic benchmarks of runtime performance of CNN training in images/sec between CPU and WX9100. This essentially confirms what we already know, that even a GPU that is not top of the line is an order of magnitude faster than training on cpu. Distributed training isn't a linear speedup, so it would likely take a significant portion of the hadoop cluster to achieve the same runtime performance as a single GPU. It's good to get a verification that the GPU is mostly working in this configuration. Note also that the current case can only fit a single gpu, but ideally future hardware would be purchased with the ability to fit at least 2 cards, or possibly 4 cards, in a single server.

Comparing these numbers to nvidia cards, the WX9100 seems to be around 40-80% the speed of an nvidia 1080ti, depending on which network is being trained. The 1080ti is about 50% of nvidia's top of the line datacenter card.

ConfigE5-2640-CPUE5-2640-WX9100Speedup
resnet504.14137.6033.2x
resnet1521.8046.1225.6x
inception34.0768.1316.7x
inception42.1129.2313.8x
vgg162.5954.1220.9x
alexnet40.05964.7224.1x
ssd3001.9645.8123.4x

Comparing loss, this isn't the loss on a test set (there is no test set afaict, these are synthetic data benchmarks), things are roughly similar. The loss is simply the mean loss of the last 10 batches of a 100 mini-batch training. Not sure why three of the cpu benchmarks have nan loss, they were nan from beginning to end. The other numbers are reasonable enough it doesn't seem worthwhile to dig into the nan cpu results.

ConfigE5-2640-CPUE5-2640-WX9100
resnet508.238.102
resnet1529.9310.132
inception37.4377.409
inception47.9027.695
vgg16nan?7.250
alexnetnan?7.200
ssd300nan?686.462
elukey added a comment.Apr 5 2019, 6:54 AM

tensorflow-rocm 1.13.1 available for Python 3.7 on PyPi! https://pypi.org/project/tensorflow-rocm/1.13.1/#files

elukey added a comment.Apr 5 2019, 1:38 PM

@EBernhardson question for you - while working on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501156/ (to allow Miriam and Gilles to ssh to stat1005) I wondered if the gpu-testers group (with full root perms) is still needed or not. I'd like to start putting a bit more puppet automation to the host, to leave the "hacky/testing" environment and transforming it a bit more to a production service. Does it make sense?

Doesn't seem to be needed anymore, feel free to start moving this to a more production configuration.

Change 501156 abandoned by Elukey:
admin: add gpu-users group and assign it to stat1005

Reason:
Had a chat with Erik, going to modify gpu-testers directly

https://gerrit.wikimedia.org/r/501156

Change 501575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501580 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501580 merged by Elukey:
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501589 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501589 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501600 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501600 merged by Elukey:
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501608 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 501621 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501621 merged by Elukey:
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501632 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501632 merged by Elukey:
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501635 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

Change 501635 merged by Elukey:
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

elukey added a comment.Apr 5 2019, 4:58 PM

The long list of patches above was needed to allow to deploy the common set of packages that all stat/notebook boxes have (so excluding hadoop client stuff). Created a code change to reduce permissions for gpu-testers and add Miriam/Gilles to it in https://gerrit.wikimedia.org/r/501575.

elukey added a comment.Apr 6 2019, 5:33 PM

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

  • echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules
  • add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

I added originally only the former, and now I am wondering what is the best way forward for production. One thing that I tried to do is changing the video group with wikidev in the udev rule, but after reload/trigger a simple example of GPU usage doesn't seem working. The /dev/kfd device is correctly getting the right permissions, but I noticed that the frame buffer /dev/fd0 is set with root:video (and 660) so probably both needs to be changed?

The solution suggested by upstream, namely using adduser.conf, could be an option as well to avoid any issue in the future, but not sure what's best.

Change 501575 merged by Elukey:
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501608 merged by Elukey:
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 502233 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 had a related patch set uploaded (by Dr0ptp4kt; owner: Dr0ptp4kt):
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

  • echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules

The uaccess rule doesn't really matter to us, is grants access to locally logged in users. That is useful for e.g. a desktop system so that the logged in user can access the DRI devices of the GPU to play games.

  • add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

We already maintain group information in puppet, how about we simply use "gpu-testers" for the initial tests and mid term when the GPUs are added to Hadoop nodes we can pick the most appropriate analytics group (or add a new one if only a subset of Hadoop users should be able to access them)

After some tests with Moritz we did the following:

  • install the latest version of systemd on stat1005, in which the video group is basically replaced with the render one. We decided to keep udev rules as standard as possible, adding later on some puppet glue to put users like gpu-testers into render automatically. For the moment it will be done manually by me when a new user is added.
  • tried to remove the hsa-ext-rocr-dev package, but it is tightly coupled with a lot of other packages (as its rdepends suggests). I tried to ask to upstream via https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-481219299 what is the best practice for open-source only packages. I tried to download hsa-ext-rocr-dev and didn't find any license attached to the binary libs (control file doesn't say anything either).
  • after re-reading https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140 I noticed that hsa-ext-rocr-dev impacts image processing for OpenCL, and IIUC @Gilles might be affected for his tests.

@EBernhardson FYI I had to reinstall some packages due to the above tests, if anything is weird/broken let me know!

elukey added a comment.EditedApr 9 2019, 1:06 PM

Ok I found a simple and hacky way to test the removal of hsa-ext-rocr-dev:

elukey@stat1005:~$ dpkg -L hsa-ext-rocr-dev
/opt
/opt/rocm
/opt/rocm/hsa
/opt/rocm/hsa/lib
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Confirmed with:

elukey@stat1005:~$ /opt/rocm/opencl/bin/x86_64/clinfo  | grep -i image
  Image support:				 No
  Max read/write image args:			 0
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
elukey@stat1005:~$ sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Let's see what breaks! :)

@Gilles really curious to know if you'll have issues with image processing with OpenCL!

Change 502233 merged by Elukey:
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 merged by Elukey:
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

elukey added a comment.Apr 9 2019, 5:22 PM

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

Just added you to stat1005! Please keep in mind that we are still in a testing phase, things are not yet fully productionized so you'll likely encounter some issues while running your tests. Please report back on the task so we can try to fix them as soon as possible :)

HI All,

I quickly tested a simple training task on stat1005, fineutning a network to categorize images into 2 categories, using 2000 images, 1k per class.
After a few issues of missing libraries promptly fixed by @elukey, it worked very well.
The training went smooth, and Tensorflow was actully using the GPU, although not extensively, 0-20%, as the task was not complex :
(from Luca)

ROCm System Management Interface

GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%

1 24.0c 14.0W 852Mhz 167Mhz 8.0GT/s, x16 14.9% auto 170.0W 0% 0% 9%

End of ROCm SMI Log

Results are comparable with models trained on CPUs.
Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

So excited, thanks all for this amazing effort!

Nuria added a comment.Apr 11 2019, 9:05 PM

Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

+1

Opened https://github.com/RadeonOpenCompute/ROCm/issues/761 to upstream to see if they can remove the explicit dependencies in Debian packages to hsa-ext-rocr-dev (the only remaining closed source package).

Had an interesting chat with Gilles today about his use case. Thumbor is able to offload some functionalities like smart cropping to the GPU via OpenCL, so the plan would be to install thumbor on stat1005 and see if/how the AMD GPU will behave. This would be a very good test to see if hsa-ext-rocr-dev impacts the GPU work or not.

Nuria added a comment.Apr 19 2019, 2:56 PM

so the plan would be to install thumbor on stat1005

How would we compare the run on 1005 with other thumbor runs @Gilles ?

elukey added a comment.EditedApr 19 2019, 2:59 PM

I think the idea would be to run it with/without GPU active and see the differences in performance (IIUC). Ideally one possible outcome could be to see if Thumbor could leverage GPU accelleration on its hosts.

dr0ptp4kt added a comment.EditedApr 19 2019, 5:05 PM

(Detour)

@Nuria the other day I mentioned my project around use of DeepSpeech.

On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.

There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.

I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

[1] If you want to use the Docker container expressed in the repo the problems are amplified. The Dockerfile is running older dependencies, yet they work with tensorflow-gpu if you launch with the Nvidia --runtime inside of a bare metal Ubuntu 18.04 LTS environment; it's just that it calls out to external servers for the image pre-build plus you need git-lfs to get datasets post-build (or you need to pull those in pre-instantiation). Without Docker, it also works, but at least in a pristine WSL Ubuntu 16.04 case some binaries, although able to be complied from open source, weren't readily available from Canonical APT (I can see potential ways to work around that, but don't want to promise anything and haven't walked the make scripts nor surveyed all of the policy-permitted APT packages).

[2] For better isolation of (non-GPU) components and general ease of setup.

[3] I saw Aaron's ticket about git-lfs on the 1006/1007 boxes. In this GPU testing case on stat1005, git-lfs would be nice, but I think SCP can be made to work, too. Ultimately the dataset and package linkages need to be done inside the cluster, not directly via the internet.

Change 505694 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

Change 505694 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

(Detour)
@Nuria the other day I mentioned my project around use of DeepSpeech.
On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.
There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.
I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

Deployed git-lfs also on stat1005! I am personally in favor of supporting any testing work, but I have a couple of comments:

  • stat1005 is currently in a "testing" phase but eventually (I hope very soon) it should become a regular production host dedicated to the research team (as it was originally meant to be when we started this work). So I am all for building/training/etc.. models to see if RocM works but I also want to finally deliver this host to the Research team since it has been promised ages ago :)
  • Installing Docker would be rather problematic in my opinion from the security point of view. The SRE team is doing an extremely complex work for Kubernetes to ensure that our internal Docker registry is kept as secure as possible, and for example random images from the Internet are not allowed IIUC. The stat1005 host is inside the Analytics VLAN, so "close" to our most precious private data, and more care on this front is needed.

To conclude: I am all for supporting testing (deploying packages, trying to build, etc..) but I would not pursue the Docker road (for security concerns as explained above) and I'd need at some point to move stat1005 to the Research team to unblock their future projects.

Last but not the least: in T220698 I asked to the SRE team to investigate if the same model of GPU that runs on stat1005 could be deployed on other stat/notebook hosts. Ideally if we could get a couple more on other stat boxes it would be good to ease the testing for multiple teams/people. Our final dream is to run GPUs directly on Hadoop worker nodes but there is a ton of work to do before even figuring out that if is possible or not :)

Not sure if I have answered to your questions, if not please reach out to me, happy to help!

A reason why the SRE team is very strict in what Docker images are allowed: https://news.ycombinator.com/item?id=19763413

@elukey thanks for the follow up here. No need to block on me for the GPU. Fully agreed on the need for a secure supply chain.

Recap of what it has been done so fare in various (sub) tasks:

The goal is to finalize the tests with stat1005 (especially the ones for Thumbor/OpenCL) and eventually add stat1005 back into the pool of statistics nodes available to use.

Reporting some info from https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559:

  • it seems that there is no guarantee of ABI compatibility between tensorflow-rocm versions and ROCm versions. For example, in order to use tf-rocm 1.13.4+ we'll need to use ROCm 2.6. We are currently using ROCm 2.5 since 2.6 made the GPU not working.
  • upstream asked more info about ROCm 2.6 on our environment, so I had a chat with @Miriam and she will restart testing next week. This will allow me to upgrade to ROCm 2.6 again, and see if anything can be reported upstream. Ideally in a few days 2.6 should be either working or a patch for a newer version that fixes our use case will be found by upstream.

If you need to use the gpu on stat1005 urgently please let me know.

Change 524095 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: upgrade to ROCm 2.6

https://gerrit.wikimedia.org/r/524095

Change 524095 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: upgrade to ROCm 2.6

https://gerrit.wikimedia.org/r/524095

Restarted from a clean state as indicated by upstream, and tensorflow-rocm 1.14.0 on ROCm 2.6 seems to work with basic tests now. I am a bit confused but.. better than hours of debugging :)

I'll work with Miriam (and whoever is interested) to test a bit more stat1005, then the plan is to apply a puppet role to it and allow everybody to use the GPU.

Summary of the current state and results achieved:

  • We added the puppet automation to import AMD ROCm drivers and packages to allow any host running a GPU to be configured correctly. At the moment only stat1005 has a GPU but in the future we hope more.
  • Miriam tested Tensorflow with basic examples and it seems working fine, but more accurate tests are coming (see subtasks).
  • On stat1005 every user that wants to be able to use the GPU needs to be added to the render posix group. The solution that we found as temporary measure is that all the users in gpu-testers (a group defined by puppet and that people can ask access to) are added to render by default. Eventually every user in analytics groups will be added to the render group transparently.
  • We are testing in https://phabricator.wikimedia.org/T229347 Spark 2 for Debian Buster, since stat1005 needs up to date OS+kernel to run (the other analytics hosts are still running Debian Stretch). This is the last step to complete before adding stat1005 back to analytics users as Hadoop client and GPU-powered node. We should be able to do it before the end of this current quarter.
  • All the info documented in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU

We are testing in https://phabricator.wikimedia.org/T229347 Spark 2 for Debian Buster, since stat1005 needs up to date OS+kernel to run (the other analytics hosts are still running Debian Stretch). This is the last step to complete before adding stat1005 back to analytics users as Hadoop client and GPU-powered node. We should be able to do it before the end of this current quarter.

There are some issues with Debian Buster and Java 11 we need to investigate, so I think we can't commit to finishing this by end of quarter now. :(

Upgraded stat1005 with ROCm 2.7.1, from my tests everything looks good. Please use tensorflow-rocm 1.14.1 otherwise your scripts will fail!