Move two GPUs from Hadoop to Lift Wing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Apr 19 2023, 11:36 AM

Description

Hi folks!

The ML team would be really happy to start testing GPUs on the Lift Wing eqiad cluster (ml-serve100[1-8]), so if you have time we'd ask for the same work done in T318696.

The idea is to move the two remaining GPUs from an-worker110[0-1] to a Lift Wing host (no strong preference, any between ml-serve100[1-8] would be fine).

Details

	Subject	Repo	Branch	Lines +/-
	Update kubernetes nodes with GPU settings	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T333462 Experiment with GPUs in the Machine Learning infrastructure
		Resolved		Jclark-ctr	T335031 Move two GPUs from Hadoop to Lift Wing

Event Timeline

elukey created this task.Apr 19 2023, 11:36 AM

elukey added a parent task: T333462: Experiment with GPUs in the Machine Learning infrastructure.

Maintenance_bot added a project: SRE.Apr 19 2023, 11:45 AM

wiki_willy assigned this task to Jclark-ctr.Apr 20 2023, 6:01 PM

@Jclark-ctr Hi! Lemme know if you have some times during the next days (even next week, not urgent) to move one GPU over to ml-serve :)

@elukey would like to try to address next week are you available tuesday?

@Jclark-ctr sorryyyy didn't see the ping :(

Lemme know if you have time in these days or next week, thanks a lot! The caveat is that we'd need to move 2 GPUs from a dse-k8s-worker node, not from Hadoop:

origin node: dse-k8s-worker1002
destination node: ml-serve1001

I know that dse-k8s-worker1002 was one of the target nodes in T318696, apologies to make you re-open the same node again, but we (ML+DE+Research) decided not to move the last two GPUs from Hadoop. Thanks a lot for the patience :)

@elukey i am available any day this week except Thursday if you are available

@Jclark-ctr Thanks! I have time today and tomorrow in my afternoon, lemme know what time works best for you!

RhinosF1 subscribed.Jun 5 2023, 1:40 PM

Icinga downtime and Alertmanager silence (ID=b4799674-ad70-4117-a653-cdeaad02c246) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

dse-k8s-worker1002.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=43b4a369-edbc-4df6-b931-f35757b38bf1) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet

Change 927197 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Update kubernetes nodes with GPU settings

https://gerrit.wikimedia.org/r/927197

gerritbot added a project: Patch-For-Review.Jun 5 2023, 1:54 PM

Removed gpu from dse-k8s-worker1002
installed gpu into ml-serve1001

Change 927197 merged by Elukey:

[operations/puppet@production] Update kubernetes nodes with GPU settings

https://gerrit.wikimedia.org/r/927197

Maintenance_bot removed a project: Patch-For-Review.Jun 5 2023, 2:30 PM

Icinga downtime and Alertmanager silence (ID=2ef51d27-4384-414f-9fdf-8fe7b4c93b00) set by elukey@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet

I can confirm that the GPUs are working on ml-serve1001, thanks!

Move two GPUs from Hadoop to Lift WingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Move two GPUs from Hadoop to Lift Wing
Closed, ResolvedPublic
Actions

Related Objects
Search...