Page MenuHomePhabricator

Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Open, LowPublic

Description

From @ellery:

I have been repeatedly running into computational bottlenecks training machine learning models over the last few months. Almost all popular ML libraries offer GPU support, which can speed up model training by orders of magnitude. I was debating asking for a new personal machine with a powerful GPU, but I think it makes the most sense to install a new GPU on one of the stat machines. Its far cheaper, and we can all share the resource. I talked to Andrew and he thinks it's relatively easy to install. The current blocker is to get funding. The current top of the line Nvidia GPU is $1200.

The request has been reviewed by @Nuria and myself, is approved on my end and can be covered on Research budget.

@mark: any concern/additional questions from your team?

@ellery @Ottomata: can you guys 1) add the specs, 2) confirm with DC-Ops if the desired hardware physically fits in the box, 3) clarify on which stat machine you would like to have it installed.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  1. the null pointer is due to some code for the Hawaii GPU cards (like ours), so not supported by upstream. In this case, I'd propose to buy another supported AMD GPU card and restart from there (we need to verify some hw things like CPU supporting PCIe atomics and also PCIe x16 bus available of the target host, hopefully stat1005 will be ok).

Shouldn't we do that anyway? The purpose of these tests is to assess the whether it's a suitable setup for ML, but given that the current GFX7 GPU is a 1.5 yo model which isn't fully supported by rocm, it will skew the tests. Also, if we decide to buy more GPUs after successful tests we'll only be able to buy the GFX8 or GFX9 cards anyway.

Yep definitely, I am all for it. I hoped to get this GPU working beforehand to have a vague idea about what card we needed (and what quantity), but we could simply buy a new one and test it as well. It would be great to get a contact from AMD to find a model that hopefully works, but not sure if possible.

https://rocm.github.io/ROCmInstall.html#supported-gpus should serve as a useful enough base to select a new GPU I guess (we'll need to figure out what stat1005 supports, though)

I'd stick with:

GFX9 GPUs
“Vega 10” chips, such as on the AMD Radeon RX Vega 64 and Radeon Instinct MI25
“Vega 7nm” chips

The price for the Vega 7nm (https://www.amd.com/en/graphics/servers-radeon-instinct-deep-learning) is probably not good for buying something to test (~10k dollars from what I can see now), but the other ones look reasonable.

Constraints for stat1005:

As described in the next section, GFX8 GPUs require PCI Express 3.0 (PCIe 3.0) with support for PCIe atomics. This requires both CPU and motherboard support. GFX9 GPUs, by default, also require PCIe 3.0 with support for PCIe atomics, but they can operate in most cases without this capability.

https://rocm.github.io/ROCmInstall.html#supported-cpus

Mentions that GFX9 do not strictly require PCIe 3.0 and atomics supported, but it would be great to have them in my opinion.

Nuria added a comment.Feb 14 2019, 3:52 PM

Giving my approval here to buy a new GPU card, need to consult with @elukey when it comes to budget but I think we could use part of the moneys for druid expansion we have not yet expensed.

elukey added a comment.EditedFeb 14 2019, 4:32 PM

Need to triple check with somebody else but from the inventory stat1005 is a Dell PowerEdge 730, that should be equipped with a Intel Xeon E52600 v4, that in turn should support PCI 3.0 and atomic ops (http://www.dakel.com/pdf/xeon-e5-brief.pdf).

What does everybody think about creating a task to get quotes for a AMD Radeon RX Vega 64 card? @EBernhardson thoughts? :)

As long as it fits in the case, a high end consumer GPU from AMD should be
just fine. The most important spec for choosing will probably be the amount
of memory, more is (almost) always better. The sticking point .ight be
length, consumer cards are typically 10.5" long which might be too long for
our case? Need to find out how much space is in there.

IIRC during the last procurement task Rob took care of all the aspects related to power consumption and space, so in theory we should be covered by expert hands for this part!

I look back over things and it looks like stat1005 is in an R470 case, they advertise compatibilty with several full-size nvidia cards so length is probably ok. For cards the choices seem to be:

Radeon VII 16GB: This is a brand new card released this month, 16GB on a consumer card is almost unheard of (and google is still laughing with their 64GB tpu's). We might almost be too early, retail is $700 but I'm not seeing the 16GB version anywhere yet. Random internet blog claims "the stocks for AMD’s new 7nm Radeon VII graphics card were very limited with less than 5,000 units available globally at launch". There might also be more driver problems, it looks like the drivers for Radeon VII similarly only landed in the linux kernel a couple weeks ago: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=8bf607ca9bb42bc26bd4c8d574d88b6b6935c47d

RX Vega 64 8GB: This is amd's previous top of the line consumer card, these should be available almost anywhere. This is a perfectly acceptable card, but double the memory on the next step up is huge.

EBernhardson added a comment.EditedFeb 14 2019, 5:48 PM

Looks like one more option, a workstation card from AMD, the Vega Frontier has 16GB of memory with very similar compute to the Vega 64. Unfortunately this seems to have been discontinued, the current choice for 16GB memory is AMD Radeon Pro WX 9100. This is about double the price of a vega 64, with the primary difference being double the memory. If that is important for this test or not is hard to say. My limited understanding is that many of the models publicly available for fine tuning are built to fit a training batch in either 8, 12 or 16gb of memory. 8gb might be restrictive in some cases, but I've not enough experience to say when exactly.

Need to triple check with somebody else but from the inventory stat1005 is a Dell PowerEdge 730

I look back over things and it looks like stat1005 is in an R470 case

Looking more, here is the packing slip for stat1005: T162700#3248287
This has the firepro s9150, and was shipped with a Dell PowerEdge 730 case.

https://rocm.github.io/hardware.html is very handy. As said previously, I'd stick with a GFX9 card, I'd say a RX Vega 64 or Radeon VII 16GB, both are supported.

I had a chat with Rob and there are a lot of caveats to keep in mind, namely that we should be lucky to find a "small" enough card that fits into a PowerEdge chassis and that does not require (possibly) extra power to work. IIUC the card that we have now was bought with stat1005, and it is part of the Dell's GPU offering, namely a set of cards that are either integrated or built specifically for PowerEdge:

https://www.dell.com/learn/us/en/04/campaigns/poweredge-gpu#campaignTabs-1

Sadly the AMD offering is only the Hawaii series (GFX7, not supported by upstream but only "enabled"), so this option is probably not the best one.

I am going to open a task to discuss what to buy and link to this task :)

Opened T216226 to discuss hw requirements for the new GPU (everybody interested please subscribe/chime-in!), let's use this task to debug the drivers on stat1005 :)

Thanks to Moritz we have buster back on stat1005. I did the following:

  • Added radeon.cik_support=0 amdgpu.cik_support=1 to grub.cfg so we can reboot anytime and get the drivers working properly
  • Followed the Roc install guide with Linux kernel drivers, namely installing rocm-dev rocm-libs miopen-hip cxlactivitylogger (last two needed by tensorflow)
  • Added udev rules echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules and added myself/Erik's usernames in the video group (to avoid using sudo to use kfd).
  • Added AMD's debian repo manually (and disabled puppet to prevent them to be wiped, going to fix it asap).

/opt/rocm/bin/rocminfo works, but /opt/rocm/opencl/bin/x86_64/clinfo leads to https://phabricator.wikimedia.org/T148843#4950469.

https://github.com/RadeonOpenCompute/ROCm/issues/640 seems describing a similar issue (but with the ROCm drivers), it might help.

Change 491263 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Disable notifications for stat1005 while testing

https://gerrit.wikimedia.org/r/491263

Change 491263 merged by Elukey:
[operations/puppet@production] Disable notifications for stat1005 while testing

https://gerrit.wikimedia.org/r/491263

Very promising:

https://github.com/RadeonOpenCompute/ROCm/issues/702#issuecomment-461982554

As noted in #691 and #640, Hawaii GPUs (such as your FirePro W8100) are currently broken on ROCm 2.0. We are working to fix this, but these fixes did not make it in to ROCm 2.1. If you want to stick with ROCm 2.1, I would recommend removing your Hawaii GPU for now. Alternately, you can roll back to a previous version (e.g. using the roc-1.9.2 branch of our Experimental ROC installation scripts).

elukey added a comment.EditedFeb 18 2019, 6:46 PM

Tried to purge rocm-dev 2.1 and install 1.9.2, same problem: /opt/rocm/opencl/bin/x86_64/clinfo hangs and when I hit control-C (not successfully) I can see the following in dmesg.

[   90.690958] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[   90.699698] PGD 0 P4D 0
[   90.702524] Oops: 0000 [#1] SMP PTI
[   90.706416] CPU: 3 PID: 2581 Comm: clinfo Not tainted 4.19.0-2-amd64 #1 Debian 4.19.16-1
[   90.715445] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017
[   90.723913] RIP: 0010:amdgpu_ib_schedule+0x50/0x550 [amdgpu]
[   90.730227] Code: 89 f5 49 89 fe 48 89 54 24 08 0f b6 87 1c 02 00 00 48 85 c9 0f 84 68 03 00 00 48 8b b9 d0 00 00 00 48 8b 51 10 48 89 7c 24 18 <48> 8b 7a 38 48 89 3c 24 84 c0 0f 84 ac 04 00 00 48 83 7c 24 18 00
[   90.751183] RSP: 0018:ffffb3cb87733a90 EFLAGS: 00010286
[   90.757012] RAX: 0000000000000001 RBX: ffff9f6a9a056800 RCX: ffff9f6a9a056800
[   90.764975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
[   90.772937] RBP: 0000000000000001 R08: ffffb3cb87733b08 R09: ffff9f6a9a056800
[   90.780898] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f6298b30000
[   90.788860] R13: 0000000000ffd000 R14: ffff9f6298b34f88 R15: ffff9f6299f7b000
[   90.796823] FS:  00007efca6e2e700(0000) GS:ffff9f6a9f640000(0000) knlGS:0000000000000000
[   90.805853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   90.812264] CR2: 0000000000000038 CR3: 000000092a40a005 CR4: 00000000003606e0
[   90.820226] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   90.828189] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   90.836151] Call Trace:
[   90.838883]  ? __kmalloc+0x177/0x210
[   90.842932]  amdgpu_amdkfd_submit_ib+0xb3/0x170 [amdgpu]
[   90.848880]  deallocate_vmid.isra.10+0xe6/0x100 [amdkfd]
[   90.854817]  destroy_queue_nocpsch_locked+0x185/0x1c0 [amdkfd]
[   90.861332]  process_termination_nocpsch+0x61/0x130 [amdkfd]
[   90.867653]  kfd_process_dequeue_from_all_devices+0x3b/0x50 [amdkfd]
[   90.874749]  kfd_process_notifier_release+0xe4/0x170 [amdkfd]
[   90.881164]  __mmu_notifier_release+0x42/0xc0
[   90.886025]  exit_mmap+0x33/0x180
[   90.889724]  ? __get_user_8+0x21/0x2b
[   90.893807]  ? kmem_cache_free+0x1a7/0x1d0
[   90.898377]  mmput+0x54/0x130
[   90.901688]  do_exit+0x284/0xb20
[   90.905289]  do_group_exit+0x3a/0xa0
[   90.909279]  get_signal+0x27c/0x590
[   90.913173]  do_signal+0x36/0x6d0
[   90.916866]  ? check_preempt_curr+0x7a/0x90
[   90.921537]  ? __do_page_fault+0x26c/0x4f0
[   90.926109]  exit_to_usermode_loop+0x89/0xf0
[   90.930874]  do_syscall_64+0xfd/0x100
[   90.934960]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   90.940595] RIP: 0033:0x7efcad018fbc
[   90.944586] Code: Bad RIP value.

This is the result of attaching gdb with thread apply all bt to the handing clinfo process (high CPU usage):

[.. cut up to Thread 42 due to the same stacktrace as the one below ..]

Thread 4 (Thread 0x7f35f65b7700 (LWP 5433)):
#0  0x00007f35fcf61fbc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f35fb577aad in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007f35fb53126a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f35f6db8700 (LWP 5432)):
#0  0x00007f35fcf61fbc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f35fb577aad in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#2  0x00007f35fb530f12 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#3  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f35f80a7700 (LWP 5431)):
#0  0x00007f35fce81757 in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f35f87cb0b8 in ?? () from /opt/rocm/lib/libhsakmt.so.1
#2  0x00007f35f87c4c8f in hsaKmtWaitOnMultipleEvents () from /opt/rocm/lib/libhsakmt.so.1
#3  0x00007f35f8a36453 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f35f8a211a6 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f35f8a31ae2 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f35f89f6637 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f35fcf5bfa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007f35fce8a80f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f35fcd8e740 (LWP 5396)):
#0  0x00007f35fce71bf7 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f35f8a09225 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2  0x00007f35f8a01f26 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#3  0x00007f35f8a0ba9e in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4  0x00007f35f8a3e155 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f35f8a3e86e in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f35f8a421cd in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f35f8a1b147 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#8  0x00007f35f921c60a in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#9  0x00007f35f921e4ac in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#10 0x00007f35f91f1c5f in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#11 0x00007f35f91e3ace in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#12 0x00007f35f91eef92 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#13 0x00007f35f9213104 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#14 0x00007f35f9213f95 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#15 0x00007f35f91ec9d3 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#16 0x00007f35f91eb137 in ?? () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#17 0x00007f35f91c138a in clIcdGetPlatformIDsKHR () from /opt/rocm/opencl/lib/x86_64/libamdocl64.so
#18 0x00007f35fcf9198b in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#19 0x00007f35fcf93907 in ?? () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#20 0x00007f35fcf63947 in __pthread_once_slow () from /lib/x86_64-linux-gnu/libpthread.so.0
#21 0x00007f35fcf91f21 in clGetPlatformIDs () from /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
#22 0x000000000040f617 in ?? ()
#23 0x0000000000407c12 in ?? ()
#24 0x00007f35fcdb509b in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#25 0x000000000040e6d1 in ?? ()

Output of top -H -p 5396

5396 elukey    20   0  258.9g  72152  50876 R  99.9   0.1   3:53.34 clinfo
5431 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5432 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5433 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5434 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5435 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5436 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5437 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5438 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5439 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5440 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5441 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5442 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5443 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5444 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5445 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5446 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5447 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5448 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5449 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5450 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5451 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5452 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5453 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5454 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5455 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5456 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5457 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5458 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5459 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5460 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5461 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5462 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5463 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5464 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5465 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5466 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5467 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5468 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5469 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5470 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo
5471 elukey    20   0  258.9g  72152  50876 S   0.0   0.1   0:00.00 clinfo

The strace of the process basically consuming one CPU (70+% system) shows this:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
...

In the dmesg I can see a lot of:

[Tue Feb 19 07:29:16 2019] amdgpu: [powerplay]
                            failed to send message 282 ret is 254

But still no null pointer exception. Then if I hit control+C to the clinfo process I get BUG: unable to handle kernel NULL pointer dereference at 0000000000000038.

https://github.com/RadeonOpenCompute/ROCm/issues/482 is a very similar problem, so I tried a couple of suggestions in here:

  • export HSA_ENABLE_SDMA=0; cliinfo -> still hangs
  • compile /opt/rocm/hip/bin/hipInfo -> hangs
  • compile /opt/rocm/hip/samples/0_Intro/square -> returns the following and hangs: ### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_OUT_OF_RESOURCES (0x1008) at file:mcwamp_hsa.cpp line:1218

The user reporting the GH issue was running NixOS and "solved" the issue installing Ubuntu. At this point it seems to me that the issue might lie in either Kernel drivers or in rocm itself (running on non fully supported platforms).

What do you think about opening a GH issue to ROCm first to (hopefully) get some feedback?

What do you think about opening a GH issue to ROCm first to (hopefully) get some feedback?

Sounds good, but I wouldn't hold my breath that we'll see usable support for the current gfx7, I think it makes sense to focus on getting a new GPU.

elukey added a comment.EditedFeb 20 2019, 12:23 PM

Quick update:

amdgpu.dc=0 (set to 1 by default on 4.17+) fixes the following errors:

[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Error in phm_get_clock_info
[Tue Feb 19 08:35:06 2019] [drm:dc_create [amdgpu]] *ERROR* DC: Number of connectors is zero!

amdgpu.dpm=1 leads to:

[Wed Feb 20 12:15:15 2019] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
[Wed Feb 20 12:15:15 2019] [drm:amdgpu_device_init.cold.28 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: amdgpu_device_ip_init failed
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: Fatal error during GPU init

The following is still happening:

[Wed Feb 20 12:20:16 2019] amdgpu: [powerplay]
                            failed to send message 282 ret is 254

Updates from https://github.com/RadeonOpenCompute/ROCm/issues/714#issuecomment-465666946 are not encouraging, gfx701 is a dead end so we should buy a new card asap :(

elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 25 2019, 8:44 AM
Nuria renamed this task from GPU upgrade for stats machine to Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models.Mar 14 2019, 6:53 PM
elukey changed the status of subtask T216226: GPU upgrade for stat1005 from Open to Stalled.Mar 28 2019, 9:21 AM
elukey changed the task status from Open to Stalled.

All the info tracked in T216226. We are going to buy a AMD Radeon Pro WX 9100 16GB. Setting this task pending procurement of the new hardware.

elukey changed the task status from Stalled to Open.

Vega GPU mounted on stat1005, it looks good from a first round of tests!

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-479082013 says that tensorflow-rocm should be available for Python 3.7 soon on pypi, another good news :)

hacks abound, but basically:

  • Added deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main to sources.list.d/rocm.list
  • upgraded all the rocm packages to latest version
  • Built a python 3.6 virtualenv on stat1007 including tensorflow-rocm and tensorflow_hub
  • Copied virtualenv to stat1005

This appears to successfully run a few different models I tried: tf mnist demo, elmo from tf_hub, and miriam's image quality model.

No clue on performance yet, to do that we need to setup some model we might care about with a performant data input pipeline(tfrecords, as opposed to placeholders and feed_dict's most likely)

Nuria added a subscriber: Gilles.Apr 2 2019, 8:02 PM

ping @Miriam @Gilles so they know the status of this.

elukey added a comment.EditedApr 3 2019, 7:42 AM

I think that we should move away from hacks done up to now and start adding to puppet the config that we are using.

Things do to from the SRE side:

  • add a new component for Debian Buster for the rocm packages, and see what to do with https://github.com/RadeonOpenCompute/ROCm#closed-source-components (it seems only one package, hopefully it is not strictly needed).
  • modify the gpu-testers group to avoid sudo rules and add more people to it (Miriam/Gilles are the first ones that comes up in my mind). In this way more people will have access to stat1005 and will be able to provide feedback.
  • see if it is possible to make an apt component like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ for Debian buster, to have python 3.6 available (3.7 support should come soon but not sure when).

About the closed source package:

elukey@stat1005:~$ apt-cache rdepends hsa-ext-rocr-dev
hsa-ext-rocr-dev
Reverse Depends:
  hsa-rocr-dev
  rocm-dev
  hcc

That should be straightforward, python3.6 was only recently removed from Debian testing and we should be able to simply fetch the packages from snapshot.debian.org:
https://packages.qa.debian.org/p/python3.6/news/20190330T191602Z.html

@EBernhardson I think that the most pressing point now is to decide/test if we need hsa-ext-rocr-dev (the only package containing binary only libs). I gathered some info about the package:

https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-422174043
https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140

IIUC this library is aimed for direct image support in OpenCL, not sure if this is mandatory for us or not. From what upstream says the whole set of libs should work without any issue, so we could simply import all the packages from the ROCM repos to our reprepro and then add a dummy package for hsa-ext-rocr-dev to satisfy deb dependencies (see rdepends output above).

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

Something like that sounds great. Mainly I want to ensure that training is generating reasonable outputs, the only training test I've done so far is the mnist example which is very much a toy problem. Image's might also allow us to test if the hsa-ext-rocr-dev package brought up by @elukey is going to be necessary, as it seems that is an image support lib of some sort.

Miriam added a comment.Apr 3 2019, 7:55 PM

OK, I can prepare a task for this, or we can start from something like this maybe?
https://gist.github.com/omoindrot/dedc857cdc0e680dfb1be99762990c9c/

Nuria added a comment.Apr 3 2019, 8:19 PM

I wonder if we can use fashion mist to benchmark: https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/ as there seems to be a bunch of people using it as a replacement of original mist. Could we set up here training with/without GPU?
https://www.tensorflow.org/guide/using_gpu

@EBernhardson on stat1005 mivisionx was causing a broken apt, so after reading https://github.com/RadeonOpenCompute/ROCm/issues/350#issuecomment-370100523 I have done the following:

  • removed mivisionx to fix apt
  • removed miopen-hip
  • installed mivisionx (that brings in miopen-opencl, the one conflicting with miopen-hip for header files)

Change 501156 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: add gpu-users group and assign it to stat1005

https://gerrit.wikimedia.org/r/501156

EBernhardson added a comment.EditedApr 4 2019, 4:54 PM

With the changes in packages now trying to run any model returns:

ebernhardson@stat1005:~$ /home/ebernhardson/tf_venv/bin/python /home/ebernhardson/mnist.py
ImportError: /opt/rocm/lib/libMIOpen.so.1: version `MIOPEN_HIP_1' not found (required by /home/ebernhardson/tf_venv/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

Not sure which packages we want exactly, I would mostly just be doing trial and error.

A solution could be to remove mivisionx (not sure if needed) and restore miopen-hip

per https://github.com/RadeonOpenCompute/ROCm/issues/703#issuecomment-462598966

that means no, miopen-opencl functionality is not supported within TF.

Removed miopen-opencl and mivisionx, installed miopen-hip and training looks to be working now.

For benchmarking I poked around some more and found https://github.com/lambdal/lambda-tensorflow-benchmark (related: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/). I've started up that set of benchmarks and will report back with results. These are all image tasks afaict, will basically compare loss and images/sec to get an idea about how the card is working out.

I've noticed while doing this we might want to setup some new prometheus metrics to collect. In particular /opt/rocm/bin/rocm-smi reports gpu temp, power usage, fan% and gpu usage %. These all might be useful to record in prometheus.

EBernhardson added a comment.EditedApr 4 2019, 11:05 PM

Synthetic benchmarks of runtime performance of CNN training in images/sec between CPU and WX9100. This essentially confirms what we already know, that even a GPU that is not top of the line is an order of magnitude faster than training on cpu. Distributed training isn't a linear speedup, so it would likely take a significant portion of the hadoop cluster to achieve the same runtime performance as a single GPU. It's good to get a verification that the GPU is mostly working in this configuration. Note also that the current case can only fit a single gpu, but ideally future hardware would be purchased with the ability to fit at least 2 cards, or possibly 4 cards, in a single server.

Comparing these numbers to nvidia cards, the WX9100 seems to be around 40-80% the speed of an nvidia 1080ti, depending on which network is being trained. The 1080ti is about 50% of nvidia's top of the line datacenter card.

ConfigE5-2640-CPUE5-2640-WX9100Speedup
resnet504.14137.6033.2x
resnet1521.8046.1225.6x
inception34.0768.1316.7x
inception42.1129.2313.8x
vgg162.5954.1220.9x
alexnet40.05964.7224.1x
ssd3001.9645.8123.4x

Comparing loss, this isn't the loss on a test set (there is no test set afaict, these are synthetic data benchmarks), things are roughly similar. The loss is simply the mean loss of the last 10 batches of a 100 mini-batch training. Not sure why three of the cpu benchmarks have nan loss, they were nan from beginning to end. The other numbers are reasonable enough it doesn't seem worthwhile to dig into the nan cpu results.

ConfigE5-2640-CPUE5-2640-WX9100
resnet508.238.102
resnet1529.9310.132
inception37.4377.409
inception47.9027.695
vgg16nan?7.250
alexnetnan?7.200
ssd300nan?686.462

tensorflow-rocm 1.13.1 available for Python 3.7 on PyPi! https://pypi.org/project/tensorflow-rocm/1.13.1/#files

@EBernhardson question for you - while working on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501156/ (to allow Miriam and Gilles to ssh to stat1005) I wondered if the gpu-testers group (with full root perms) is still needed or not. I'd like to start putting a bit more puppet automation to the host, to leave the "hacky/testing" environment and transforming it a bit more to a production service. Does it make sense?

Doesn't seem to be needed anymore, feel free to start moving this to a more production configuration.

Change 501156 abandoned by Elukey:
admin: add gpu-users group and assign it to stat1005

Reason:
Had a chat with Erik, going to modify gpu-testers directly

https://gerrit.wikimedia.org/r/501156

Change 501575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501580 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501580 merged by Elukey:
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501589 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501589 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501600 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501600 merged by Elukey:
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501608 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 501621 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501621 merged by Elukey:
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501632 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501632 merged by Elukey:
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501635 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

Change 501635 merged by Elukey:
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

The long list of patches above was needed to allow to deploy the common set of packages that all stat/notebook boxes have (so excluding hadoop client stuff). Created a code change to reduce permissions for gpu-testers and add Miriam/Gilles to it in https://gerrit.wikimedia.org/r/501575.

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

  • echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules
  • add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

I added originally only the former, and now I am wondering what is the best way forward for production. One thing that I tried to do is changing the video group with wikidev in the udev rule, but after reload/trigger a simple example of GPU usage doesn't seem working. The /dev/kfd device is correctly getting the right permissions, but I noticed that the frame buffer /dev/fd0 is set with root:video (and 660) so probably both needs to be changed?

The solution suggested by upstream, namely using adduser.conf, could be an option as well to avoid any issue in the future, but not sure what's best.

Change 501575 merged by Elukey:
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501608 merged by Elukey:
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 502233 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 had a related patch set uploaded (by Dr0ptp4kt; owner: Dr0ptp4kt):
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

  • echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules

The uaccess rule doesn't really matter to us, is grants access to locally logged in users. That is useful for e.g. a desktop system so that the logged in user can access the DRI devices of the GPU to play games.

  • add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

We already maintain group information in puppet, how about we simply use "gpu-testers" for the initial tests and mid term when the GPUs are added to Hadoop nodes we can pick the most appropriate analytics group (or add a new one if only a subset of Hadoop users should be able to access them)

After some tests with Moritz we did the following:

  • install the latest version of systemd on stat1005, in which the video group is basically replaced with the render one. We decided to keep udev rules as standard as possible, adding later on some puppet glue to put users like gpu-testers into render automatically. For the moment it will be done manually by me when a new user is added.
  • tried to remove the hsa-ext-rocr-dev package, but it is tightly coupled with a lot of other packages (as its rdepends suggests). I tried to ask to upstream via https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-481219299 what is the best practice for open-source only packages. I tried to download hsa-ext-rocr-dev and didn't find any license attached to the binary libs (control file doesn't say anything either).
  • after re-reading https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140 I noticed that hsa-ext-rocr-dev impacts image processing for OpenCL, and IIUC @Gilles might be affected for his tests.

@EBernhardson FYI I had to reinstall some packages due to the above tests, if anything is weird/broken let me know!

elukey added a comment.EditedApr 9 2019, 1:06 PM

Ok I found a simple and hacky way to test the removal of hsa-ext-rocr-dev:

elukey@stat1005:~$ dpkg -L hsa-ext-rocr-dev
/opt
/opt/rocm
/opt/rocm/hsa
/opt/rocm/hsa/lib
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Confirmed with:

elukey@stat1005:~$ /opt/rocm/opencl/bin/x86_64/clinfo  | grep -i image
  Image support:				 No
  Max read/write image args:			 0
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
elukey@stat1005:~$ sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Let's see what breaks! :)

@Gilles really curious to know if you'll have issues with image processing with OpenCL!

Change 502233 merged by Elukey:
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 merged by Elukey:
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

Just added you to stat1005! Please keep in mind that we are still in a testing phase, things are not yet fully productionized so you'll likely encounter some issues while running your tests. Please report back on the task so we can try to fix them as soon as possible :)

HI All,

I quickly tested a simple training task on stat1005, fineutning a network to categorize images into 2 categories, using 2000 images, 1k per class.
After a few issues of missing libraries promptly fixed by @elukey, it worked very well.
The training went smooth, and Tensorflow was actully using the GPU, although not extensively, 0-20%, as the task was not complex :
(from Luca)

ROCm System Management Interface

GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%

1 24.0c 14.0W 852Mhz 167Mhz 8.0GT/s, x16 14.9% auto 170.0W 0% 0% 9%

End of ROCm SMI Log

Results are comparable with models trained on CPUs.
Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

So excited, thanks all for this amazing effort!

Nuria added a comment.Apr 11 2019, 9:05 PM

Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

+1

Opened https://github.com/RadeonOpenCompute/ROCm/issues/761 to upstream to see if they can remove the explicit dependencies in Debian packages to hsa-ext-rocr-dev (the only remaining closed source package).

Had an interesting chat with Gilles today about his use case. Thumbor is able to offload some functionalities like smart cropping to the GPU via OpenCL, so the plan would be to install thumbor on stat1005 and see if/how the AMD GPU will behave. This would be a very good test to see if hsa-ext-rocr-dev impacts the GPU work or not.

Nuria added a comment.Apr 19 2019, 2:56 PM

so the plan would be to install thumbor on stat1005

How would we compare the run on 1005 with other thumbor runs @Gilles ?

elukey added a comment.EditedApr 19 2019, 2:59 PM

I think the idea would be to run it with/without GPU active and see the differences in performance (IIUC). Ideally one possible outcome could be to see if Thumbor could leverage GPU accelleration on its hosts.

dr0ptp4kt added a comment.EditedApr 19 2019, 5:05 PM

(Detour)

@Nuria the other day I mentioned my project around use of DeepSpeech.

On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.

There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.

I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

[1] If you want to use the Docker container expressed in the repo the problems are amplified. The Dockerfile is running older dependencies, yet they work with tensorflow-gpu if you launch with the Nvidia --runtime inside of a bare metal Ubuntu 18.04 LTS environment; it's just that it calls out to external servers for the image pre-build plus you need git-lfs to get datasets post-build (or you need to pull those in pre-instantiation). Without Docker, it also works, but at least in a pristine WSL Ubuntu 16.04 case some binaries, although able to be complied from open source, weren't readily available from Canonical APT (I can see potential ways to work around that, but don't want to promise anything and haven't walked the make scripts nor surveyed all of the policy-permitted APT packages).

[2] For better isolation of (non-GPU) components and general ease of setup.

[3] I saw Aaron's ticket about git-lfs on the 1006/1007 boxes. In this GPU testing case on stat1005, git-lfs would be nice, but I think SCP can be made to work, too. Ultimately the dataset and package linkages need to be done inside the cluster, not directly via the internet.

Change 505694 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

Change 505694 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

(Detour)

@Nuria the other day I mentioned my project around use of DeepSpeech.

On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.

There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.

I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

Deployed git-lfs also on stat1005! I am personally in favor of supporting any testing work, but I have a couple of comments:

  • stat1005 is currently in a "testing" phase but eventually (I hope very soon) it should become a regular production host dedicated to the research team (as it was originally meant to be when we started this work). So I am all for building/training/etc.. models to see if RocM works but I also want to finally deliver this host to the Research team since it has been promised ages ago :)
  • Installing Docker would be rather problematic in my opinion from the security point of view. The SRE team is doing an extremely complex work for Kubernetes to ensure that our internal Docker registry is kept as secure as possible, and for example random images from the Internet are not allowed IIUC. The stat1005 host is inside the Analytics VLAN, so "close" to our most precious private data, and more care on this front is needed.

To conclude: I am all for supporting testing (deploying packages, trying to build, etc..) but I would not pursue the Docker road (for security concerns as explained above) and I'd need at some point to move stat1005 to the Research team to unblock their future projects.

Last but not the least: in T220698 I asked to the SRE team to investigate if the same model of GPU that runs on stat1005 could be deployed on other stat/notebook hosts. Ideally if we could get a couple more on other stat boxes it would be good to ease the testing for multiple teams/people. Our final dream is to run GPUs directly on Hadoop worker nodes but there is a ton of work to do before even figuring out that if is possible or not :)

Not sure if I have answered to your questions, if not please reach out to me, happy to help!

A reason why the SRE team is very strict in what Docker images are allowed: https://news.ycombinator.com/item?id=19763413

@elukey thanks for the follow up here. No need to block on me for the GPU. Fully agreed on the need for a secure supply chain.