Page MenuHomePhabricator

Move some GPUs from Hadoop to the DSE-K8S cluster
Closed, ResolvedPublic

Description

GPU move - Stage 1

Stage 1 involves moving two GPU cards from the Hadoop cluster to the DSE-K8S cluster.
We are going to attempt to add both of these two cards into a single host.

  • Shut down an-worker1096
  • Shut down an-worker1097
  • Shut down dse-k8s-worker1001
  • Remove the GPU card from an-worker1096
  • Remove the GPU card from an-worker1097
  • Install both GPU cards into dse-k8s-worker1001
  • Retrieve the GPU Ready Configuration Cable Install Kit (470-ACQQ) from an-worker1096
  • Retrieve the GPU Ready Configuration Cable Install Kit (470-ACQQ) from an-worker1097
  • Boot all three servers

GPU move - Stage 2

Stage 1 involves moving another two GPU cards from the Hadoop cluster to the DSE-K8S cluster.
We are going to add both of these two cards into a single host.

  • Shut down an-worker1098
  • Shut down an-worker1099
  • Shut down dse-k8s-worker1002
  • Remove the GPU card from an-worker1098
  • Remove the GPU card from an-worker1099
  • Install both GPU cards into dse-k8s-worker1002
  • Boot all three servers

Once this work is done, @BTullis will follow up with puppet changes to remove the GPU customization from an-worker109[8-9]

Original description below

Current status:

  • We have six AMD GPUs that are currently installed to the Hadoop cluster but these are currently under-utilized.
  • We have a new Kubernetes cluster named DSE-K8S which we would like to be able to use for GPU based workloads.
  • Four of the hosts in this DSE-K8S cluster are supposedly GPU-ready, with all necessary cable kits in place.
  • The other four nodes are supposedly GPU compatible, but are missing the cable kits.

See the following page for more information on the existing GPUs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/AMD_GPU

Desired status:

  • Four of the six existing GPU cards are removed from the Hadoop cluster
  • These four cards are installed in pairs to two of the dse-k8s-worker nodes
  • Any spare CPU cable kits are reclaimed from the Hadoop hosts from which the cards were removed (if feasible, convenient, practical etc)

This will obviously require the kind input and cooperation of the DC-Ops team and specifically ops-eqiad in order to carry out the physical card moves, but Data-Engineering can collaborate on shutting down and depooling the relevant servers in order to facilitate the work.

I have already spoken to representatives from the Research team, such as @leila and @Miriam and I believe that they are happy in principle for this hardware move. I know also that @achou is one of the main users of the existing cards in Hadoop, so may have insights on how and when to proceed.

Ultimately, this is still an experiment focused on trying to extract value from the GPUs that we already have and ascertain more about the compatibility. We know that in thew longer term the DC Ops team would prefer that we buy servers with GPUs fitted in the factory, but that's not been possible yet and therefore we would be keen for this hardware move if it's at all practical.

Event Timeline

@BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DSE workers?

We can probably due without their capacity for a while and/or perhaps we can just replace them with non-GPU equivalent Hadoop workers?

@BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DSE workers?

We can probably due without their capacity for a while and/or perhaps we can just replace them with non-GPU equivalent Hadoop workers?

Thanks Andrew. Definitely worth considering, but I think I'd prefer to look at moving the cards first, if that's feasible.

The main reason is that in order to do it properly, we would want to rename the hosts whilst reimaging:
https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

This is a bit experimental and a bit fiddly. It's not too bad if you rename to a new name that has never been used before, but it's worse if you want to try to re-use an old name.
This makes it even more awkward if we ever wished put them back as hadoop workers.

What's your opinion @Cmjohnson ? Would you be willing to have a look at moving four of these GPUs to different servers?
We can arrange for plenty of downtime for these servers, so the work wouldn't have to be done within a tight maintenance window, if that helps.

Would you prefer me to fill out a hardware troubleshooting form instead of this ticket?

If you don't think it's workable then as @Ottomata suggests, I can look instead at renaming and reimaging these servers to make them Kubernetes workers, instead of Hadoop workers.

Hi @Jclark-ctr would you mind if we try to do some work on this one day next week? We can just start by trying to move two cards and see how it goes.
If that's ok with you, what day would be best? Did you say that you'd prefer to start the work at 9am est? I can shut down the servers ahead of time without any issue.

I'll update the task description with the specifics of what I'd like us to try, which is just moving two cards. We're not going to be moving any chassis or external cables.

Any day next week except Monday

BTullis added a subscriber: leila.

Any day next week except Monday

Great! Let's book it in for Wednesday next week 9:00 EST - I'll shut down the three servers ahead of time.

Description updated above. If it works smoothly, we can replicate for a stage 2 and move another two cards. If it doesn't work smoothly, we can reassess.

Icinga downtime and Alertmanager silence (ID=ad975722-2d29-4e76-b155-59e38bc020f3) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1096.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=b637e9a2-c8cd-43d2-ab57-1acb06e6d236) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1097.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=a56d2950-9f5c-4f2c-8fc1-ddb7900637da) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

dse-k8s-worker1001.eqiad.wmnet

Removed 2 gpu from an-worker1097 ,an-worker1096 And reinstalled in dse-k8s-worker1001

Jclark-ctr reopened this task as Open.

Accidentally closed task

Thanks @Jclark-ctr, that's excellent.
I can confirm that both cards are detected correctly.

btullis@dse-k8s-worker1001:~$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
da:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]

gpu-3.JPG (1×768 px, 159 KB)

As you mentioned, it's not ideal that the fan from one of the two cards blows directly up at the server's lid, but I guess we will just have to keep an eye on the temperature and change things around if it seems to be getting to hot under load.

Same time tomorrow for stage 2 then? Moving another two cards into dse-k8s-worker1002. Once again, I'll update the description above and shut down the hosts in advance.

BTullis renamed this task from Attempt to move some GPUs from Hadoop to the DSE-K8S cluster to Move some GPUs from Hadoop to the DSE-K8S cluster.Feb 8 2023, 4:24 PM
BTullis updated the task description. (Show Details)

Change 887807 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the GPU configuration from an-worker109[67]

https://gerrit.wikimedia.org/r/887807

Mentioned in SAL (#wikimedia-analytics) [2023-02-09T12:01:32Z] <btullis> Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696

Icinga downtime and Alertmanager silence (ID=49b5d5ab-a254-46d1-b90a-001be80f16b7) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1098.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=2de0632f-155c-4404-88de-ffa2c986cd08) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1099.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=ac0799f5-49e4-45fd-99e4-a3048068dc8a) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

dse-k8s-worker1002.eqiad.wmnet

removed gpu from an-worker1098, an-worker1099. installed both gpu into dse-k8s-worker1002

Great! Thanks @Jclark-ctr both cards detected.

btullis@dse-k8s-worker1002:~$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
da:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]

Resolving this ticket.

Change 887807 merged by Btullis:

[operations/puppet@production] Remove the GPU configuration from an-worker109[6-9]

https://gerrit.wikimedia.org/r/887807