Page MenuHomePhabricator

Investigate procuring and installing GPUs on Lift Wing
Closed, ResolvedPublic

Description

What models do we need? What servers will be able to host them?

Event Timeline

Some ideas/constraints/etc..:

  1. We could get a new GPU that is 1:1 similar (in height/width/etc..) to the one deployed on hadoop nodes, and with similar power requirements. DE is trying to move the Hadoop ones to DSE in T318696.
  2. We need to be on K8s 1.23 before being able to expose a gpu to a pod, since https://github.com/RadeonOpenCompute/k8s-device-plugin supports only k8s 1.18+ (and we are on .16 now).
  3. If we get a new GPU, we need to make sure that the chosen one is compatible with the AMD ROCm drivers (checking on the website/docs/etc..). The ones offered by DELL are convenient but at the time they were old and not supported by ROCm.

Super excited by this given that Research has been exploring more advanced transformer models that strongly benefit from GPUs not just as training but at prediction time as well.

Maybe very naive question but how should I expect that a GPU would work on LiftWing? Would it only be available to one model at a time or is there a way to share it effectively across models? I guess I'm saying should I think of it as a really awesome single CPU core or a new awesome server with many cores or something in between?

Super excited by this given that Research has been exploring more advanced transformer models that strongly benefit from GPUs not just as training but at prediction time as well.

Maybe very naive question but how should I expect that a GPU would work on LiftWing? Would it only be available to one model at a time or is there a way to share it effectively across models? I guess I'm saying should I think of it as a really awesome single CPU core or a new awesome server with many cores or something in between?

Totally the opposite, your question is valid and really interesting. I started looking into options when you mentioned it during the ML/Research meeting, and so far I found that the Nvidia K8s plugin offers a thing called "MIG" (multi gpu sharing):

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html

I still didn't get if the AMD ROCm equivalent (https://github.com/RadeonOpenCompute/k8s-device-plugin) supports the same, but I guess that they will need to add this feature since it is critical for k8s workloads. We'll do more research and experiments on this side, hopefully we'll be able to have a good compromise between GPU sharability and open-source drivers/tools :)

We don't say it explicitly but we encourage you and everybody interested to add links and suggestions to the task! The more info we have the better!

elukey renamed this task from get a GPU on Lift Wing to Get a GPU on Lift Wing.Jan 31 2023, 7:55 AM

I'm trying to find whether kserve supports sharing GPU among pods/model servers (nothing clear until now).
What seems promising on this topic is the Model Mesh architecture where multiple models share the same server. However it is still in alpha version so I wouldn't count on it for the time being.

I had a chat with SRE today, and they pointed me to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi, so the MIG technology from Nvidia is definitely able to share GPUs across containers.

calbon renamed this task from Get a GPU on Lift Wing to Investigate procuring and installing two GPUs on Lift Wing.Feb 7 2023, 3:30 PM

Status: waiting for GPUs to be moved from Hadoop Cluster to DSE Cluster and seeing if we can experiment on that.

I had a chat with SRE today, and they pointed me to https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi, so the MIG technology from Nvidia is definitely able to share GPUs across containers.

Super exciting -- that definitely would make these GPUs much more useful!

We don't say it explicitly but we encourage you and everybody interested to add links and suggestions to the task! The more info we have the better!

Awesome -- I'll raise this with the team too

I can report that two have the first two GPUs on the Lift Wing / DSE cluster.

We moved two cards from nodes in the Hadoop cluster to dse-k8s-worker1001, so we now have one K8S node with two GPUs.

gpu-3.JPG (1×768 px, 159 KB)

The plan is to move another two tomorrow, for a total of four at the moment.
We will leave the remaining two in the Hadoop cluster, unless the consensus is that they would be better elsewhere.

One way by which we intend to make these available is via Spark, e.g. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#requesting-gpu-resources

However, I'm sure that there are other methods too, so hopefully we'll be able to work together on these.

Super interesting article found by Ilias:

https://journal.arrikto.com/gpu-virtualization-in-k8s-challenges-and-state-of-the-art-a1cafbcdd12b

Not sure if https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html was taken into account (seems newer though, not sure if anybody tested it with K8s).

Very interesting to see that the GPU time sharing in GKE happens only on time basis (and with limitations): https://cloud.google.com/kubernetes-engine/docs/concepts/timesharing-gpus

elukey moved this task from Blocked to New Projects to review on the Machine-Learning-Team board.
elukey renamed this task from Investigate procuring and installing two GPUs on Lift Wing to Investigate procuring and installing GPUs on Lift Wing.Apr 20 2023, 1:22 PM
elukey updated the task description. (Show Details)

Some info gathered while reading docs :)

We have essentially two options:

  • AMD (GPUs compatible with ROCm)
  • Nvidia

As it came up from T333009, the Nvidia plugin seems superior, it offers a way to share a single GPUs between multiple pods, but the technology seems really at its first stages, like well summarized in:

https://towardsdatascience.com/how-to-increase-gpu-utilization-in-kubernetes-with-nvidia-mps-e680d20c3181

Few highlights:

  • About time-sharing:

However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.

  • About MIG (7 virtual GPUs on top of a physical one, only for very expensive cards):

MIG is the GPU sharing approach that offers the highest level of isolation among processes. However, it lacks flexibility and it is compatible only with few GPU architectures (Ampere and Hopper).

In MPS, however, client processes are not fully isolated from each other. Indeed, even though MPS allows to limit clients’ compute and memory resources, it does not provide error isolation and memory protection. This means that a client process can crash and cause the entire GPU to reset, impacting all other processes running on the GPU.

The Nvidia plugin drawbacks are also related to maintainability:

  • https://github.com/NVIDIA/k8s-device-plugin#prerequisites - there are some binary-only/proprietary packages that Nvidia wants to deploy to enable the plugin, probably not following our open source policy (and also diverging a lot from our standards, for example we are going to migrate away from Docker soon).
  • All the CUDA-related libraries and drivers are binary only, with non open source licenses.

The AMD k8s plugin seems to work, but it doesn't offer any kind of GPU sharing yet (it may in the future but I couldn't find any trace of it). On the plus side, all the AMD stack is open source and compatible with our policies.

I found some cards to evaluate, a good compromise (in my opinion, please chime in if you have more options!) between price and performances:

The price ranges vary a lot, but these are the best options in my opinion. One important consideration - the above cards are PCIe 4.0, and we have only 3.0 slots on our ml-serve nodes, so bandwidth will be limited.

Some news on the AMD front - we successfully tested GPUs on K8s in T333009 (DSE cluster), and the KServe upstream folks suggested to use inference batching to improve the throughput of GPU pods (we'll test it in T335480) as alternative to overcome the lack of GPU sharing solution on AMD.

Found some interesting links about the work that Intel and AMD are doing to support concurrent access to the GPUs:

There seems to be no clear support in the Kernel up to now, and the work started a couple of years ago, so it doesn't look good for us, but sometimes the time taken to get approvals from the kernel community is high. It is good to know that the problem is not only AMD's but also Intel's, and that they are thinking to resolve it directly in the Kernel, not with closed source libs/drivers like Nvidia.

I found some cards to evaluate, a good compromise (in my opinion, please chime in if you have more options!) between price and performances:

The price ranges vary a lot, but these are the best options in my opinion. One important consideration - the above cards are PCIe 4.0, and we have only 3.0 slots on our ml-serve nodes, so bandwidth will be limited.

More thoughts on the AMD front. In the ROCm docs (like https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/Prerequisites.html) two architectures are mentioned: CDNA and RDNA. The former is more specialized for data centers and ML work, meanwhile the latter is more specialized for Gaming. IIUC the Radeon Pro w6800 that I mentioned above is under the RDNA umbrella, and it would make sense since it offers things like multiple ports for external monitors that we don't really need. In the CDNA architecture I found two GPUs that are interesting:

  • AMD Instinct M100 - Pros: 32GB of RAM, ROCm support - Cons: 3y old GPU, expensive (compared to w6800 in Dell's store, but elsewhere it seems cheaper), PCIe 4.0 support (on all our nodes we have only 3.x now).
  • AMD Instinct M50 - Pros: 16GB of RAM, ROCm support, PCIe 3.x/4.x compatible, relatively inexpensive (compared to w6800) - Cons: 5y old GPU
elukey claimed this task.

We are in the process to order an AMD Instinct MI100, we'll open new tasks to test it :)