⚓ T318696 Move some GPUs from Hadoop to the DSE-K8S cluster

	Subject	Repo	Branch	Lines +/-
	Remove the GPU configuration from an-worker109[6-9]	operations/puppet	production	+10 -3

leila unsubscribed.

BTullis moved this task from Next Up to Blocked/Paused on the Shared-Data-Infrastructure (Sprint 02) board.Sep 28 2022, 9:38 AM

• Cmjohnson moved this task from Backlog to Lower Priority Items on the ops-eqiad board.Sep 28 2022, 5:18 PM

• EChetty moved this task from Sprint 02 to Discussed (Tracking) on the Shared-Data-Infrastructure board.Oct 18 2022, 1:25 PM

• EChetty edited projects, added Shared-Data-Infrastructure; removed Shared-Data-Infrastructure (Sprint 02).

@BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DSE workers?

We can probably due without their capacity for a while and/or perhaps we can just replace them with non-GPU equivalent Hadoop workers?

In T318696#8358171, @Ottomata wrote:

@BTullis can/should we just remove those nodes as Hadoop workers and reimage them as DSE workers?

We can probably due without their capacity for a while and/or perhaps we can just replace them with non-GPU equivalent Hadoop workers?

Thanks Andrew. Definitely worth considering, but I think I'd prefer to look at moving the cards first, if that's feasible.

The main reason is that in order to do it properly, we would want to rename the hosts whilst reimaging:
https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging

This is a bit experimental and a bit fiddly. It's not too bad if you rename to a new name that has never been used before, but it's worse if you want to try to re-use an old name.
This makes it even more awkward if we ever wished put them back as hadoop workers.

What's your opinion @Cmjohnson ? Would you be willing to have a look at moving four of these GPUs to different servers?
We can arrange for plenty of downtime for these servers, so the work wouldn't have to be done within a tight maintenance window, if that helps.

Would you prefer me to fill out a hardware troubleshooting form instead of this ticket?

If you don't think it's workable then as @Ottomata suggests, I can look instead at renaming and reimaging these servers to make them Kubernetes workers, instead of Hadoop workers.

wiki_willy assigned this task to Jclark-ctr.Nov 22 2022, 7:59 PM

elukey mentioned this in T327923: Investigate procuring and installing GPUs on Lift Wing.Jan 30 2023, 4:51 PM

Hi @Jclark-ctr would you mind if we try to do some work on this one day next week? We can just start by trying to move two cards and see how it goes.
If that's ok with you, what day would be best? Did you say that you'd prefer to start the work at 9am est? I can shut down the servers ahead of time without any issue.

I'll update the task description with the specifics of what I'd like us to try, which is just moving two cards. We're not going to be moving any chassis or external cables.

Any day next week except Monday

BTullis updated the task description. (Show Details)Feb 3 2023, 3:57 PM

BTullis added a subscriber: leila.

In T318696#8585442, @Jclark-ctr wrote:

Any day next week except Monday

Great! Let's book it in for Wednesday next week 9:00 EST - I'll shut down the three servers ahead of time.

Description updated above. If it works smoothly, we can replicate for a stage 2 and move another two cards. If it doesn't work smoothly, we can reassess.

BTullis edited projects, added Shared-Data-Infrastructure (2022-23 Q4 Wrap up); removed Shared-Data-Infrastructure.Feb 3 2023, 4:01 PM

BTullis added a subscriber: elukey.Feb 8 2023, 11:56 AM

Icinga downtime and Alertmanager silence (ID=ad975722-2d29-4e76-b155-59e38bc020f3) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1096.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=b637e9a2-c8cd-43d2-ab57-1acb06e6d236) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1097.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=a56d2950-9f5c-4f2c-8fc1-ddb7900637da) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

dse-k8s-worker1001.eqiad.wmnet

BTullis added a subscriber: klausman.Feb 8 2023, 12:00 PM

BTullis updated the task description. (Show Details)Feb 8 2023, 12:04 PM

BTullis moved this task from Next Up to In Progress on the Shared-Data-Infrastructure (2022-23 Q4 Wrap up) board.Feb 8 2023, 1:50 PM

Jclark-ctr updated the task description. (Show Details)Feb 8 2023, 2:34 PM

Removed 2 gpu from an-worker1097 ,an-worker1096 And reinstalled in dse-k8s-worker1001

Accidentally closed task

Thanks @Jclark-ctr, that's excellent.
I can confirm that both cards are detected correctly.

btullis@dse-k8s-worker1001:~$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
da:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]

As you mentioned, it's not ideal that the fan from one of the two cards blows directly up at the server's lid, but I guess we will just have to keep an eye on the temperature and change things around if it seems to be getting to hot under load.

Same time tomorrow for stage 2 then? Moving another two cards into dse-k8s-worker1002. Once again, I'll update the description above and shut down the hosts in advance.

BTullis renamed this task from Attempt to move some GPUs from Hadoop to the DSE-K8S cluster to Move some GPUs from Hadoop to the DSE-K8S cluster.Feb 8 2023, 4:24 PM

BTullis updated the task description. (Show Details)

Change 887807 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove the GPU configuration from an-worker109[67]

https://gerrit.wikimedia.org/r/887807

gerritbot added a project: Patch-For-Review.Feb 8 2023, 4:39 PM

Mentioned in SAL (#wikimedia-analytics) [2023-02-09T12:01:32Z] <btullis> Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696

Icinga downtime and Alertmanager silence (ID=49b5d5ab-a254-46d1-b90a-001be80f16b7) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1098.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=2de0632f-155c-4404-88de-ffa2c986cd08) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

an-worker1099.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=ac0799f5-49e4-45fd-99e4-a3048068dc8a) set by btullis@cumin1001 for 8:00:00 on 1 host(s) and their services with reason: Attempting to move some GPUs

dse-k8s-worker1002.eqiad.wmnet

BTullis updated the task description. (Show Details)Feb 9 2023, 2:02 PM

removed gpu from an-worker1098, an-worker1099. installed both gpu into dse-k8s-worker1002

Great! Thanks @Jclark-ctr both cards detected.

btullis@dse-k8s-worker1002:~$ lspci|grep VGA
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
3d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]
da:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon PRO WX 9100]

Resolving this ticket.

Change 887807 merged by Btullis:

[operations/puppet@production] Remove the GPU configuration from an-worker109[6-9]

https://gerrit.wikimedia.org/r/887807

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 11:31 AM

elukey mentioned this in T335031: Move two GPUs from Hadoop to Lift Wing.Apr 19 2023, 11:36 AM

Move some GPUs from Hadoop to the DSE-K8S cluster
Closed, ResolvedPublic
Actions

Description

GPU move - Stage 1

GPU move - Stage 2

Original description below

Details

Related Objects

Event Timeline

	BTullis
	Sep 27 2022, 12:21 PM

	F36803357: gpu-3.JPG
	Feb 8 2023, 3:32 PM

Move some GPUs from Hadoop to the DSE-K8S clusterClosed, ResolvedPublicActions