Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• DarTar
	Oct 21 2016, 3:31 PM

Description

From @ellery:

I have been repeatedly running into computational bottlenecks training machine learning models over the last few months. Almost all popular ML libraries offer GPU support, which can speed up model training by orders of magnitude. I was debating asking for a new personal machine with a powerful GPU, but I think it makes the most sense to install a new GPU on one of the stat machines. Its far cheaper, and we can all share the resource. I talked to Andrew and he thinks it's relatively easy to install. The current blocker is to get funding. The current top of the line Nvidia GPU is $1200.

The request has been reviewed by @Nuria and myself, is approved on my end and can be covered on Research budget.

@mark: any concern/additional questions from your team?

@ellery @Ottomata: can you guys 1) add the specs, 2) confirm with DC-Ops if the desired hardware physically fits in the box, 3) clarify on which stat machine you would like to have it installed.

Details

Subject	Repo	Branch	Lines +/-
role::statistics::explorer::gpu: allow analytics users to log in	operations/puppet	production	+2 -0
profile::statistics::gpu: upgrade to ROCm 2.6	operations/puppet	production	+1 -1
profile::analytics::cluster::packages::statistics: install git-lfs on buster	operations/puppet	production	+3 -5
Add dr0ptp4kt to gpu-testers	operations/puppet	production	+1 -1
Rely only on ores::base for common packages deployed to Analytics misc	operations/puppet	production	+2 -92
ores::base: fix package requires for Debian Buster	operations/puppet	production	+40 -33
admin: remove sudo permissions from gpu-testers and add users to it	operations/puppet	production	+2 -3
Fix last common packages for Analytics hosts for Debian Buster	operations/puppet	production	+8 -2
Fix more common packages for Analytics hosts for Debian Buster	operations/puppet	production	+35 -15
Fix more common packages deployed to Buster based Analytics nodes	operations/puppet	production	+12 -2
profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs	operations/puppet	production	+18 -3
profile::analytics::cluster::packages::statistics: add myspell guards	operations/puppet	production	+16 -10
role::statistics::gpu: add common statistics packages	operations/puppet	production	+13 -2
admin: add gpu-users group and assign it to stat1005	operations/puppet	production	+5 -1
Disable notifications for stat1005 while testing	operations/puppet	production	+1 -0
Set Debian installer to Stretch for stat1005	operations/puppet	production	+0 -2

Related Objects
Search...

Status	Assigned	Task
Open	Miriam	T215413 Image Classification Research and Development
Resolved	elukey	T148843 Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Declined	• Cmjohnson	T151080 check stat1004 (or another identical R430) for PCIe expansion space
Resolved	RobH	T159838 EQIAD: stat1002 replacement
		Unknown Object (Task)
Resolved	Ottomata	T165368 rack/setup/install replacement to stat1005 (stat1002 replacement)
Declined	None	T151904 User limits for stat machines. Limit space on /home dir and possibly /tmp
Resolved	elukey	T216226 GPU upgrade for stat1005
Resolved	• Cmjohnson	T216528 confirm gpu form factor in stat1005
		Unknown Object (Task)
Resolved	• Cmjohnson	T219522 install new GPU in stat1005
Resolved	elukey	T220784 Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created
Resolved	Miriam	T221761 Test GPUs with an end-to-end training task (Photo vs Graphics image classifier)
Declined	• Gilles	T220811 Test Thumbor OpenCL smart cropping on stat1005
Resolved	jijiki	T221562 Build Thumbor packages for buster
Resolved	elukey	T224723 Import AMD rocm packages in wikimedia-buster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Quick update:

amdgpu.dc=0 (set to 1 by default on 4.17+) fixes the following errors:

[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[Tue Feb 19 08:35:06 2019] amdgpu: [powerplay] Error in phm_get_clock_info
[Tue Feb 19 08:35:06 2019] [drm:dc_create [amdgpu]] *ERROR* DC: Number of connectors is zero!

amdgpu.dpm=1 leads to:

[Wed Feb 20 12:15:15 2019] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 12 test failed
[Wed Feb 20 12:15:15 2019] [drm:amdgpu_device_init.cold.28 [amdgpu]] *ERROR* hw_init of IP block <vce_v2_0> failed -110
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: amdgpu_device_ip_init failed
[Wed Feb 20 12:15:15 2019] amdgpu 0000:04:00.0: Fatal error during GPU init

The following is still happening:

[Wed Feb 20 12:20:16 2019] amdgpu: [powerplay]
                            failed to send message 282 ret is 254

Updates from https://github.com/RadeonOpenCompute/ROCm/issues/714#issuecomment-465666946 are not encouraging, gfx701 is a dead end so we should buy a new card asap :(

elukey added a subtask: T216226: GPU upgrade for stat1005.Feb 21 2019, 5:27 PM

elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 25 2019, 8:44 AM

Stalled on https://phabricator.wikimedia.org/T216528

All the info tracked in T216226. We are going to buy a AMD Radeon Pro WX 9100 16GB. Setting this task pending procurement of the new hardware.

Vega GPU mounted on stat1005, it looks good from a first round of tests!

https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-479082013 says that tensorflow-rocm should be available for Python 3.7 soon on pypi, another good news :)

hacks abound, but basically:

Added deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main to sources.list.d/rocm.list
upgraded all the rocm packages to latest version
Built a python 3.6 virtualenv on stat1007 including tensorflow-rocm and tensorflow_hub
Copied virtualenv to stat1005

This appears to successfully run a few different models I tried: tf mnist demo, elmo from tf_hub, and miriam's image quality model.

No clue on performance yet, to do that we need to setup some model we might care about with a performant data input pipeline(tfrecords, as opposed to placeholders and feed_dict's most likely)

ping @Miriam @Gilles so they know the status of this.

I think that we should move away from hacks done up to now and start adding to puppet the config that we are using.

Things do to from the SRE side:

add a new component for Debian Buster for the rocm packages, and see what to do with https://github.com/RadeonOpenCompute/ROCm#closed-source-components (it seems only one package, hopefully it is not strictly needed).
modify the gpu-testers group to avoid sudo rules and add more people to it (Miriam/Gilles are the first ones that comes up in my mind). In this way more people will have access to stat1005 and will be able to provide feedback.
see if it is possible to make an apt component like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ for Debian buster, to have python 3.6 available (3.7 support should come soon but not sure when).

About the closed source package:

elukey@stat1005:~$ apt-cache rdepends hsa-ext-rocr-dev
hsa-ext-rocr-dev
Reverse Depends:
  hsa-rocr-dev
  rocm-dev
  hcc

In T148843#5080853, @elukey wrote:

see if it is possible to make an apt component like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480041/ for Debian buster, to have python 3.6 available (3.7 support should come soon but not sure when).

That should be straightforward, python3.6 was only recently removed from Debian testing and we should be able to simply fetch the packages from snapshot.debian.org:
https://packages.qa.debian.org/p/python3.6/news/20190330T191602Z.html

@EBernhardson I think that the most pressing point now is to decide/test if we need hsa-ext-rocr-dev (the only package containing binary only libs). I gathered some info about the package:

https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-422174043
https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140

IIUC this library is aimed for direct image support in OpenCL, not sure if this is mandatory for us or not. From what upstream says the whole set of libs should work without any issue, so we could simply import all the packages from the ROCM repos to our reprepro and then add a dummy package for hsa-ext-rocr-dev to satisfy deb dependencies (see rdepends output above).

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

In T148843#5081277, @Miriam wrote:

Thanks @EBernhardson and all!!. Would a CNN finetuning task, using few thousand images only as input, work as a training task for testing performance?

Something like that sounds great. Mainly I want to ensure that training is generating reasonable outputs, the only training test I've done so far is the mnist example which is very much a toy problem. Image's might also allow us to test if the hsa-ext-rocr-dev package brought up by @elukey is going to be necessary, as it seems that is an image support lib of some sort.

OK, I can prepare a task for this, or we can start from something like this maybe?
https://gist.github.com/omoindrot/dedc857cdc0e680dfb1be99762990c9c/

I wonder if we can use fashion mist to benchmark: https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/ as there seems to be a bunch of people using it as a replacement of original mist. Could we set up here training with/without GPU?
https://www.tensorflow.org/guide/using_gpu

@EBernhardson on stat1005 mivisionx was causing a broken apt, so after reading https://github.com/RadeonOpenCompute/ROCm/issues/350#issuecomment-370100523 I have done the following:

removed mivisionx to fix apt
removed miopen-hip
installed mivisionx (that brings in miopen-opencl, the one conflicting with miopen-hip for header files)

Change 501156 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: add gpu-users group and assign it to stat1005

https://gerrit.wikimedia.org/r/501156

With the changes in packages now trying to run any model returns:

ebernhardson@stat1005:~$ /home/ebernhardson/tf_venv/bin/python /home/ebernhardson/mnist.py

ImportError: /opt/rocm/lib/libMIOpen.so.1: version `MIOPEN_HIP_1' not found (required by /home/ebernhardson/tf_venv/lib/python3.5/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so)

Not sure which packages we want exactly, I would mostly just be doing trial and error.

A solution could be to remove mivisionx (not sure if needed) and restore miopen-hip

per https://github.com/RadeonOpenCompute/ROCm/issues/703#issuecomment-462598966

that means no, miopen-opencl functionality is not supported within TF.

Removed miopen-opencl and mivisionx, installed miopen-hip and training looks to be working now.

For benchmarking I poked around some more and found https://github.com/lambdal/lambda-tensorflow-benchmark (related: https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/). I've started up that set of benchmarks and will report back with results. These are all image tasks afaict, will basically compare loss and images/sec to get an idea about how the card is working out.

I've noticed while doing this we might want to setup some new prometheus metrics to collect. In particular /opt/rocm/bin/rocm-smi reports gpu temp, power usage, fan% and gpu usage %. These all might be useful to record in prometheus.

Synthetic benchmarks of runtime performance of CNN training in images/sec between CPU and WX9100. This essentially confirms what we already know, that even a GPU that is not top of the line is an order of magnitude faster than training on cpu. Distributed training isn't a linear speedup, so it would likely take a significant portion of the hadoop cluster to achieve the same runtime performance as a single GPU. It's good to get a verification that the GPU is mostly working in this configuration. Note also that the current case can only fit a single gpu, but ideally future hardware would be purchased with the ability to fit at least 2 cards, or possibly 4 cards, in a single server.

Comparing these numbers to nvidia cards, the WX9100 seems to be around 40-80% the speed of an nvidia 1080ti, depending on which network is being trained. The 1080ti is about 50% of nvidia's top of the line datacenter card.

Config	E5-2640-CPU	E5-2640-WX9100	Speedup
resnet50	4.14	137.60	33.2x
resnet152	1.80	46.12	25.6x
inception3	4.07	68.13	16.7x
inception4	2.11	29.23	13.8x
vgg16	2.59	54.12	20.9x
alexnet	40.05	964.72	24.1x
ssd300	1.96	45.81	23.4x

Comparing loss, this isn't the loss on a test set (there is no test set afaict, these are synthetic data benchmarks), things are roughly similar. The loss is simply the mean loss of the last 10 batches of a 100 mini-batch training. Not sure why three of the cpu benchmarks have nan loss, they were nan from beginning to end. The other numbers are reasonable enough it doesn't seem worthwhile to dig into the nan cpu results.

Config	E5-2640-CPU	E5-2640-WX9100
resnet50	8.23	8.102
resnet152	9.93	10.132
inception3	7.437	7.409
inception4	7.902	7.695
vgg16	nan?	7.250
alexnet	nan?	7.200
ssd300	nan?	686.462

tensorflow-rocm 1.13.1 available for Python 3.7 on PyPi! https://pypi.org/project/tensorflow-rocm/1.13.1/#files

@EBernhardson question for you - while working on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501156/ (to allow Miriam and Gilles to ssh to stat1005) I wondered if the gpu-testers group (with full root perms) is still needed or not. I'd like to start putting a bit more puppet automation to the host, to leave the "hacky/testing" environment and transforming it a bit more to a production service. Does it make sense?

Doesn't seem to be needed anymore, feel free to start moving this to a more production configuration.

Change 501156 abandoned by Elukey:
admin: add gpu-users group and assign it to stat1005

Reason:
Had a chat with Erik, going to modify gpu-testers directly

https://gerrit.wikimedia.org/r/501156

Change 501575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501580 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501580 merged by Elukey:
[operations/puppet@production] role::statistics::gpu: add common statistics packages

https://gerrit.wikimedia.org/r/501580

Change 501589 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501589 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: add myspell guards

https://gerrit.wikimedia.org/r/501589

Change 501600 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501600 merged by Elukey:
[operations/puppet@production] profile::an::cluster::pkgs::statistics: add better handling of myspell pkgs

https://gerrit.wikimedia.org/r/501600

Change 501608 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 501621 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501621 merged by Elukey:
[operations/puppet@production] Fix more common packages deployed to Buster based Analytics nodes

https://gerrit.wikimedia.org/r/501621

Change 501632 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501632 merged by Elukey:
[operations/puppet@production] Fix more common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501632

Change 501635 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

Change 501635 merged by Elukey:
[operations/puppet@production] Fix last common packages for Analytics hosts for Debian Buster

https://gerrit.wikimedia.org/r/501635

The long list of patches above was needed to allow to deploy the common set of packages that all stat/notebook boxes have (so excluding hadoop client stuff). Created a code change to reduce permissions for gpu-testers and add Miriam/Gilles to it in https://gerrit.wikimedia.org/r/501575.

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules
add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

I added originally only the former, and now I am wondering what is the best way forward for production. One thing that I tried to do is changing the video group with wikidev in the udev rule, but after reload/trigger a simple example of GPU usage doesn't seem working. The /dev/kfd device is correctly getting the right permissions, but I noticed that the frame buffer /dev/fd0 is set with root:video (and 660) so probably both needs to be changed?

The solution suggested by upstream, namely using adduser.conf, could be an option as well to avoid any issue in the future, but not sure what's best.

Change 501575 merged by Elukey:
[operations/puppet@production] admin: remove sudo permissions from gpu-testers and add users to it

https://gerrit.wikimedia.org/r/501575

Change 501608 merged by Elukey:
[operations/puppet@production] ores::base: fix package requires for Debian Buster

https://gerrit.wikimedia.org/r/501608

Change 502233 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 had a related patch set uploaded (by Dr0ptp4kt; owner: Dr0ptp4kt):
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

In T148843#5090494, @elukey wrote:

The https://rocm.github.io/ROCmInstall.html module lists among the things to do the following:

echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules

The uaccess rule doesn't really matter to us, is grants access to locally logged in users. That is useful for e.g. a desktop system so that the logged in user can access the DRI devices of the GPU to play games.

add the video group to adduser.conf (to automatically add new non system users to the video group when they are created)

We already maintain group information in puppet, how about we simply use "gpu-testers" for the initial tests and mid term when the GPUs are added to Hadoop nodes we can pick the most appropriate analytics group (or add a new one if only a subset of Hadoop users should be able to access them)

After some tests with Moritz we did the following:

install the latest version of systemd on stat1005, in which the video group is basically replaced with the render one. We decided to keep udev rules as standard as possible, adding later on some puppet glue to put users like gpu-testers into render automatically. For the moment it will be done manually by me when a new user is added.
tried to remove the hsa-ext-rocr-dev package, but it is tightly coupled with a lot of other packages (as its rdepends suggests). I tried to ask to upstream via https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33#issuecomment-481219299 what is the best practice for open-source only packages. I tried to download hsa-ext-rocr-dev and didn't find any license attached to the binary libs (control file doesn't say anything either).
after re-reading https://github.com/RadeonOpenCompute/ROCm/issues/267#issuecomment-422172140 I noticed that hsa-ext-rocr-dev impacts image processing for OpenCL, and IIUC @Gilles might be affected for his tests.

@EBernhardson FYI I had to reinstall some packages due to the above tests, if anything is weird/broken let me know!

Ok I found a simple and hacky way to test the removal of hsa-ext-rocr-dev:

elukey@stat1005:~$ dpkg -L hsa-ext-rocr-dev
/opt
/opt/rocm
/opt/rocm/hsa
/opt/rocm/hsa/lib
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9
/opt/rocm/hsa/lib/libhsa-ext-image64.so.1
/opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Confirmed with:

elukey@stat1005:~$ /opt/rocm/opencl/bin/x86_64/clinfo  | grep -i image
  Image support:				 No
  Max read/write image args:			 0
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
elukey@stat1005:~$ sudo rm /opt/rocm/hsa/lib/libhsa-ext-image64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1.1.9 /opt/rocm/hsa/lib/libhsa-ext-image64.so.1 /opt/rocm/hsa/lib/libhsa-runtime-tools64.so.1

Let's see what breaks! :)

@Gilles really curious to know if you'll have issues with image processing with OpenCL!

Change 502233 merged by Elukey:
[operations/puppet@production] Rely only on ores::base for common packages deployed to Analytics misc

https://gerrit.wikimedia.org/r/502233

Change 502341 merged by Elukey:
[operations/puppet@production] Add dr0ptp4kt to gpu-testers

https://gerrit.wikimedia.org/r/502341

In T148843#5095613, @dr0ptp4kt wrote:

Hi, I'm requesting access to gpu-testers as well in order to begin validating model building.

Just added you to stat1005! Please keep in mind that we are still in a testing phase, things are not yet fully productionized so you'll likely encounter some issues while running your tests. Please report back on the task so we can try to fix them as soon as possible :)

Thanks @elukey

HI All,

I quickly tested a simple training task on stat1005, fineutning a network to categorize images into 2 categories, using 2000 images, 1k per class.
After a few issues of missing libraries promptly fixed by @elukey, it worked very well.
The training went smooth, and Tensorflow was actully using the GPU, although not extensively, 0-20%, as the task was not complex :
(from Luca)

ROCm System Management Interface

GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%

1 24.0c 14.0W 852Mhz 167Mhz 8.0GT/s, x16 14.9% auto 170.0W 0% 0% 9%

End of ROCm SMI Log

Results are comparable with models trained on CPUs.
Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

So excited, thanks all for this amazing effort!

Next, I would like to test a more complex task, and measure how much we gain in performance between GPU and CPU.

Opened https://github.com/RadeonOpenCompute/ROCm/issues/761 to upstream to see if they can remove the explicit dependencies in Debian packages to hsa-ext-rocr-dev (the only remaining closed source package).

elukey changed the status of subtask T220784: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created from Open to Stalled.Apr 18 2019, 11:50 AM

Had an interesting chat with Gilles today about his use case. Thumbor is able to offload some functionalities like smart cropping to the GPU via OpenCL, so the plan would be to install thumbor on stat1005 and see if/how the AMD GPU will behave. This would be a very good test to see if hsa-ext-rocr-dev impacts the GPU work or not.

so the plan would be to install thumbor on stat1005

How would we compare the run on 1005 with other thumbor runs @Gilles ?

I think the idea would be to run it with/without GPU active and see the differences in performance (IIUC). Ideally one possible outcome could be to see if Thumbor could leverage GPU accelleration on its hosts.

(Detour)

@Nuria the other day I mentioned my project around use of DeepSpeech.

On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.

There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.

I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

[1] If you want to use the Docker container expressed in the repo the problems are amplified. The Dockerfile is running older dependencies, yet they work with tensorflow-gpu if you launch with the Nvidia --runtime inside of a bare metal Ubuntu 18.04 LTS environment; it's just that it calls out to external servers for the image pre-build plus you need git-lfs to get datasets post-build (or you need to pull those in pre-instantiation). Without Docker, it also works, but at least in a pristine WSL Ubuntu 16.04 case some binaries, although able to be complied from open source, weren't readily available from Canonical APT (I can see potential ways to work around that, but don't want to promise anything and haven't walked the make scripts nor surveyed all of the policy-permitted APT packages).

[2] For better isolation of (non-GPU) components and general ease of setup.

[3] I saw Aaron's ticket about git-lfs on the 1006/1007 boxes. In this GPU testing case on stat1005, git-lfs would be nice, but I think SCP can be made to work, too. Ultimately the dataset and package linkages need to be done inside the cluster, not directly via the internet.

Change 505694 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

Change 505694 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::packages::statistics: install git-lfs on buster

https://gerrit.wikimedia.org/r/505694

In T148843#5125683, @dr0ptp4kt wrote:

(Detour)

@Nuria the other day I mentioned my project around use of DeepSpeech.

On my GTX 1080 at home, with a smallish set of about 900 WAV files of 5 seconds length or less the model training is 20+ times faster on GPU than without; according to nvidia-smi it was pegging out the RAM and cores on the GPU more or less.

There's some assembly required in that repo [1], probably too much to be practical in our environment. But if I get some spare time here or someone wanted to partner up, I'd be interested in digging into the feasibility of trying to make this run with ROCm components to see if the performance gains are comparable (or more exaggerated given the hardware) for the AMD card. It would be fun to run this on a bigger dataset (20 GB of MP3s) which has way more than 900 files. I was running this on spinning disk, although we can probably think of disk access as a constant, non-variable parameter.

I'm not saying we strictly need it, but if we could find time to pursue this line further would it be possible to get Docker [2] and git-lfs [3] installed on stat1005 if deemed beneficial / necessary?

Deployed git-lfs also on stat1005! I am personally in favor of supporting any testing work, but I have a couple of comments:

stat1005 is currently in a "testing" phase but eventually (I hope very soon) it should become a regular production host dedicated to the research team (as it was originally meant to be when we started this work). So I am all for building/training/etc.. models to see if RocM works but I also want to finally deliver this host to the Research team since it has been promised ages ago :)
Installing Docker would be rather problematic in my opinion from the security point of view. The SRE team is doing an extremely complex work for Kubernetes to ensure that our internal Docker registry is kept as secure as possible, and for example random images from the Internet are not allowed IIUC. The stat1005 host is inside the Analytics VLAN, so "close" to our most precious private data, and more care on this front is needed.

To conclude: I am all for supporting testing (deploying packages, trying to build, etc..) but I would not pursue the Docker road (for security concerns as explained above) and I'd need at some point to move stat1005 to the Research team to unblock their future projects.

Last but not the least: in T220698 I asked to the SRE team to investigate if the same model of GPU that runs on stat1005 could be deployed on other stat/notebook hosts. Ideally if we could get a couple more on other stat boxes it would be good to ease the testing for multiple teams/people. Our final dream is to run GPUs directly on Hadoop worker nodes but there is a ton of work to do before even figuring out that if is possible or not :)

Not sure if I have answered to your questions, if not please reach out to me, happy to help!

elukey added a subtask: T220811: Test Thumbor OpenCL smart cropping on stat1005.Apr 26 2019, 11:19 AM

A reason why the SRE team is very strict in what Docker images are allowed: https://news.ycombinator.com/item?id=19763413

@elukey thanks for the follow up here. No need to block on me for the GPU. Fully agreed on the need for a secure supply chain.

elukey added a subtask: T224723: Import AMD rocm packages in wikimedia-buster.May 31 2019, 10:43 AM

elukey changed the status of subtask T220784: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created from Stalled to Open.Jul 10 2019, 7:18 AM

Recap of what it has been done so fare in various (sub) tasks:

in T224723 a lot of work was done to import ROCm packages into our APT repository. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU was created to document how to handle upgrades of the ROCm suite.
as part of the above step, ROCm was migrated to 2.6 (from 2.4). @Miriam tested it with Tensorflow and we had to open https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559 because it was unusable. We then downgraded ROCm to 2.5 and tensorflow-rocm (the pypi package) works, but only up to version 1.13.3 (1.13.4 and 1.14.0 seem broken so far with ROCm 2.5).
we have now prometheus metrics for the GPU on stat1005 - https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu

The goal is to finalize the tests with stat1005 (especially the ones for Thumbor/OpenCL) and eventually add stat1005 back into the pool of statistics nodes available to use.

Reporting some info from https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/559:

it seems that there is no guarantee of ABI compatibility between tensorflow-rocm versions and ROCm versions. For example, in order to use tf-rocm 1.13.4+ we'll need to use ROCm 2.6. We are currently using ROCm 2.5 since 2.6 made the GPU not working.
upstream asked more info about ROCm 2.6 on our environment, so I had a chat with @Miriam and she will restart testing next week. This will allow me to upgrade to ROCm 2.6 again, and see if anything can be reported upstream. Ideally in a few days 2.6 should be either working or a patch for a newer version that fixes our use case will be found by upstream.

If you need to use the gpu on stat1005 urgently please let me know.

Change 524095 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::statistics::gpu: upgrade to ROCm 2.6

https://gerrit.wikimedia.org/r/524095

Change 524095 merged by Elukey:
[operations/puppet@production] profile::statistics::gpu: upgrade to ROCm 2.6

https://gerrit.wikimedia.org/r/524095

Restarted from a clean state as indicated by upstream, and tensorflow-rocm 1.14.0 on ROCm 2.6 seems to work with basic tests now. I am a bit confused but.. better than hours of debugging :)

I'll work with Miriam (and whoever is interested) to test a bit more stat1005, then the plan is to apply a puppet role to it and allow everybody to use the GPU.

• Nuria closed subtask T224723: Import AMD rocm packages in wikimedia-buster as Resolved.Jul 18 2019, 8:33 PM

• Nuria closed subtask T220784: Investigate if a Prometheus exporter for the AMD GPU(s) can be easily created as Resolved.

Summary of the current state and results achieved:

We added the puppet automation to import AMD ROCm drivers and packages to allow any host running a GPU to be configured correctly. At the moment only stat1005 has a GPU but in the future we hope more.
Miriam tested Tensorflow with basic examples and it seems working fine, but more accurate tests are coming (see subtasks).
On stat1005 every user that wants to be able to use the GPU needs to be added to the render posix group. The solution that we found as temporary measure is that all the users in gpu-testers (a group defined by puppet and that people can ask access to) are added to render by default. Eventually every user in analytics groups will be added to the render group transparently.
We are testing in https://phabricator.wikimedia.org/T229347 Spark 2 for Debian Buster, since stat1005 needs up to date OS+kernel to run (the other analytics hosts are still running Debian Stretch). This is the last step to complete before adding stat1005 back to analytics users as Hadoop client and GPU-powered node. We should be able to do it before the end of this current quarter.
All the info documented in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU

We are testing in https://phabricator.wikimedia.org/T229347 Spark 2 for Debian Buster, since stat1005 needs up to date OS+kernel to run (the other analytics hosts are still running Debian Stretch). This is the last step to complete before adding stat1005 back to analytics users as Hadoop client and GPU-powered node. We should be able to do it before the end of this current quarter.

There are some issues with Debian Buster and Java 11 we need to investigate, so I think we can't commit to finishing this by end of quarter now. :(

Upgraded stat1005 with ROCm 2.7.1, from my tests everything looks good. Please use tensorflow-rocm 1.14.1 otherwise your scripts will fail!

elukey moved this task from Stalled to Keep an eye on it on the User-Elukey board.Sep 12 2019, 12:04 PM

Change 539022 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer::gpu: allow analytics users to log in

https://gerrit.wikimedia.org/r/539022

Change 539022 merged by Elukey:
[operations/puppet@production] role::statistics::explorer::gpu: allow analytics users to log in

https://gerrit.wikimedia.org/r/539022

• Gilles closed subtask T220811: Test Thumbor OpenCL smart cropping on stat1005 as Declined.Sep 25 2019, 1:49 PM

Given the fact that the GPU on stat1005 seems to work and we have documentation in place explaining how to use it, I'd be inclined to finally close this task :)

@Nuria thoughts?

ta-tachannnn!!!!

Miriam closed subtask T221761: Test GPUs with an end-to-end training task (Photo vs Graphics image classifier) as Resolved.Jan 2 2020, 3:23 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

	F4943213: Dell-PowerEdge-R730-Spec-Sheet.pdf
	Dec 2 2016, 8:19 PM

Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML modelsClosed, ResolvedPublicActions