GPUs are not correctly handling multitasking
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Miriam
	Mar 26 2020, 11:50 AM

Description

GPUs can handle more than one process only were processes are not computationally demanding. Some observations:

Completing 2 or more "light" processes in parallel takes less time than completing the same number of processes in sequence, as expected.
A computationally light process launched when the GPU is already busy, or in parallel to other light processes, is slower than the same process having 100% of the GPU available (also expected).
This changes when processes are computationally demanding. It appears that the GPU cannot limit the resources that each process can use, and each process takes all the GPU resources needed.
As a result, parallel heavy processes might request more than 100% of the GPU computational capacity. In this cases, the GPU saturates and stops working, all running processes are stalled, and even after killing them, the saturated GPU keeps working at 100% capacity.
In these cases, when trying to launch a new process requiring GPU capacity, it is aborted with error messages such as:

2020-03-25 18:22:22.830582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon PRO WX 9100], pci bus id: 0000:3d:00.0)
2020-03-25 18:22:25.161591: E tensorflow/stream_executor/rocm/rocm_driver.cc:615] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)
[...]
Memory access fault by GPU node-2 (Agent handle: 0x10a305e0) on address 0x7f4d336f5000. Reason: Page not present or supervisor privilege.
Fatal Python error: Aborted

Rebooting the machine is the only way we have found so far to solve this GPU saturation issue.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		elukey	T247082 Upgrade AMD ROCm to latest upstream
		Resolved		elukey	T248574 GPUs are not correctly handling multitasking

Event Timeline

Miriam created this task.Mar 26 2020, 11:50 AM

We are still not sure what the issue is, but we decided to upgrade stat1008 according to T247082 to have the last ROCm upstream version before contacting the devs.

• fdans moved this task from Incoming to Machine Learning Platform on the Analytics board.Mar 30 2020, 4:15 PM

Reedy mentioned this in T236431: Data dumps for the MachineVision extension.Apr 4 2020, 3:10 PM

elukey mentioned this in T247082: Upgrade AMD ROCm to latest upstream.Apr 9 2020, 12:50 PM

So we did a few tests with the latest ROCm version.

When the GPU saturates, there is no need to reboot, as killing the stalled processes is enough for the GPU to release the resources. This is a big improvement compared to the previous version!
We found that the saturation is related to a VRAM usage problem
We found a Tensorflow-native solution to dynamically allocate the memory used by a process on the GPU. Added to every Tensorflow code, it allows multiple users to run tensorflow scripts on the GPU at the same time. More info here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script

This gives us a more stable configuration of the GPU on stat1008, which can be shared by many users, provided that they follow the guideline of Tensorflow script config. Yay!

MoritzMuehlenhoff subscribed.Apr 9 2020, 9:28 PM

The recent update of the GPU kernel-side drivers to using the rock-dkms package from upstream seems to have resolved this issue (parallel jobs seem to work just fine now.)

Closing this and adjusting https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Outstanding_issues accordingly.

klausman reopened this task as Open.Sep 17 2020, 1:29 PM

klausman closed this task as Resolved.

GPUs are not correctly handling multitasking Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

GPUs are not correctly handling multitasking
Closed, ResolvedPublic
Actions

Related Objects
Search...