Page MenuHomePhabricator

GPUs are not correctly handling multitasking
Closed, ResolvedPublic

Description

GPUs can handle more than one process only were processes are not computationally demanding. Some observations:

  • Completing 2 or more "light" processes in parallel takes less time than completing the same number of processes in sequence, as expected.
  • A computationally light process launched when the GPU is already busy, or in parallel to other light processes, is slower than the same process having 100% of the GPU available (also expected).
  • This changes when processes are computationally demanding. It appears that the GPU cannot limit the resources that each process can use, and each process takes all the GPU resources needed.
  • As a result, parallel heavy processes might request more than 100% of the GPU computational capacity. In this cases, the GPU saturates and stops working, all running processes are stalled, and even after killing them, the saturated GPU keeps working at 100% capacity.
  • In these cases, when trying to launch a new process requiring GPU capacity, it is aborted with error messages such as:
2020-03-25 18:22:22.830582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon PRO WX 9100], pci bus id: 0000:3d:00.0)
2020-03-25 18:22:25.161591: E tensorflow/stream_executor/rocm/rocm_driver.cc:615] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)
[...]
Memory access fault by GPU node-2 (Agent handle: 0x10a305e0) on address 0x7f4d336f5000. Reason: Page not present or supervisor privilege.
Fatal Python error: Aborted

Rebooting the machine is the only way we have found so far to solve this GPU saturation issue.

Event Timeline

We are still not sure what the issue is, but we decided to upgrade stat1008 according to T247082 to have the last ROCm upstream version before contacting the devs.

So we did a few tests with the latest ROCm version.

  • When the GPU saturates, there is no need to reboot, as killing the stalled processes is enough for the GPU to release the resources. This is a big improvement compared to the previous version!
  • We found that the saturation is related to a VRAM usage problem
  • We found a Tensorflow-native solution to dynamically allocate the memory used by a process on the GPU. Added to every Tensorflow code, it allows multiple users to run tensorflow scripts on the GPU at the same time. More info here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Configure_your_Tensorflow_script

This gives us a more stable configuration of the GPU on stat1008, which can be shared by many users, provided that they follow the guideline of Tensorflow script config. Yay!

klausman subscribed.

The recent update of the GPU kernel-side drivers to using the rock-dkms package from upstream seems to have resolved this issue (parallel jobs seem to work just fine now.)

Closing this and adjusting https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU#Outstanding_issues accordingly.

klausman closed this task as Resolved.