Page MenuHomePhabricator

Move two GPUs from Hadoop to Lift Wing
Closed, ResolvedPublic

Description

Hi folks!

The ML team would be really happy to start testing GPUs on the Lift Wing eqiad cluster (ml-serve100[1-8]), so if you have time we'd ask for the same work done in T318696.

The idea is to move the two remaining GPUs from an-worker110[0-1] to a Lift Wing host (no strong preference, any between ml-serve100[1-8] would be fine).

Event Timeline

@Jclark-ctr Hi! Lemme know if you have some times during the next days (even next week, not urgent) to move one GPU over to ml-serve :)

@elukey would like to try to address next week are you available tuesday?

@Jclark-ctr sorryyyy didn't see the ping :(

Lemme know if you have time in these days or next week, thanks a lot! The caveat is that we'd need to move 2 GPUs from a dse-k8s-worker node, not from Hadoop:

  • origin node: dse-k8s-worker1002
  • destination node: ml-serve1001

I know that dse-k8s-worker1002 was one of the target nodes in T318696, apologies to make you re-open the same node again, but we (ML+DE+Research) decided not to move the last two GPUs from Hadoop. Thanks a lot for the patience :)

@elukey i am available any day this week except Thursday if you are available

@Jclark-ctr Thanks! I have time today and tomorrow in my afternoon, lemme know what time works best for you!

Icinga downtime and Alertmanager silence (ID=b4799674-ad70-4117-a653-cdeaad02c246) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

dse-k8s-worker1002.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=43b4a369-edbc-4df6-b931-f35757b38bf1) set by elukey@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet

Change 927197 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Update kubernetes nodes with GPU settings

https://gerrit.wikimedia.org/r/927197

Removed gpu from dse-k8s-worker1002
installed gpu into ml-serve1001

Change 927197 merged by Elukey:

[operations/puppet@production] Update kubernetes nodes with GPU settings

https://gerrit.wikimedia.org/r/927197

Icinga downtime and Alertmanager silence (ID=2ef51d27-4384-414f-9fdf-8fe7b4c93b00) set by elukey@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: Host under maintenance

ml-serve1001.eqiad.wmnet

I can confirm that the GPUs are working on ml-serve1001, thanks!