GPUs can handle more than one process only were processes are not computationally demanding. Some observations:
- Completing 2 or more "light" processes in parallel takes less time than completing the same number of processes in sequence, as expected.
- A computationally light process launched when the GPU is already busy, or in parallel to other light processes, is slower than the same process having 100% of the GPU available (also expected).
- This changes when processes are computationally demanding. It appears that the GPU cannot limit the resources that each process can use, and each process takes all the GPU resources needed.
- As a result, parallel heavy processes might request more than 100% of the GPU computational capacity. In this cases, the GPU saturates and stops working, all running processes are stalled, and even after killing them, the saturated GPU keeps working at 100% capacity.
- In these cases, when trying to launch a new process requiring GPU capacity, it is aborted with error messages such as:
2020-03-25 18:22:22.830582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XT [Radeon PRO WX 9100], pci bus id: 0000:3d:00.0) 2020-03-25 18:22:25.161591: E tensorflow/stream_executor/rocm/rocm_driver.cc:615] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002) [...] Memory access fault by GPU node-2 (Agent handle: 0x10a305e0) on address 0x7f4d336f5000. Reason: Page not present or supervisor privilege. Fatal Python error: Aborted
Rebooting the machine is the only way we have found so far to solve this GPU saturation issue.