Page MenuHomePhabricator

Check if a GPU fits in any of the remaining stat or notebook hosts
Closed, ResolvedPublic

Description

Hi everybody,

the GPU deployment to stat1005 seems to have been a success, we had positive feedback by users and we are thinking about adding a couple more (if possible) to other hosts. I opened this task to see if any of the following hosts have enough space to host a GPU like the one installed on stat1005:

  • stat1004
  • stat1006
  • stat1007
  • notebook1003
  • notebook1004

Basically similar to what we have done in T216528.

Event Timeline

elukey triaged this task as Medium priority.Apr 11 2019, 2:00 PM
elukey created this task.
fdans moved this task from Incoming to Radar on the Analytics board.Apr 11 2019, 4:46 PM
elukey assigned this task to Cmjohnson.Apr 26 2019, 11:27 AM
Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.May 28 2019, 2:54 PM
elukey added a comment.Nov 7 2019, 9:15 AM

@Cmjohnson we might need to add a new GPU next quarter (need to triple check with the Research team), is there any of the above hosts that can host a GPU in your opinion or should we open each one of them and measure to find one?

elukey mentioned this in Unknown Object (Task).Nov 19 2019, 10:14 AM
RobH added a subscriber: RobH.Nov 25 2019, 6:25 PM

So, I advise against any guesswork on this. If you want to know if these 5 servers will hold a GPU, each purchase group needs to be popped open. I'll break it down:

hostname | chassis | purchase task
stat1004 |R430| T120248
stat1006 | DL360Gen9 | T161315
stat1007 | R440 | T196345
notebook1003 | R430 | T175603
notebook1004 | R430 | T175603

So, notebook100[34] are the same internal configuration. Otherwise they all have to have the following checked:

Things that have to be checked in each of these configurations:

stat1004:

  • - is there a PCI riser with spare PCIe slot for use by a GPU and not in use by anything else
  • - is there a spare power dongle to connect to the GPU, as they all seem to require extra power
  • - will the card ordered via T216528 for stat1005 fit in this chassis

stat1006:

  • - is there a PCI riser with spare PCIe slot for use by a GPU and not in use by anything else
  • - is there a spare power dongle to connect to the GPU, as they all seem to require extra power
  • - will the card ordered via T216528 for stat1005 fit in this chassis

stat1007:

  • - is there a PCI riser with spare PCIe slot for use by a GPU and not in use by anything else
  • - is there a spare power dongle to connect to the GPU, as they all seem to require extra power
  • - will the card ordered via T216528 for stat1005 fit in this chassis

notebook100[34] (identical, can check either):

  • - is there a PCI riser with spare PCIe slot for use by a GPU and not in use by anything else
  • - is there a spare power dongle to connect to the GPU, as they all seem to require extra power
  • - will the card ordered via T216528 for stat1005 fit in this chassis

If we do that for the above, we can determine if they can accomodate the GPU already put into stat1005.

RobH added a comment.Nov 25 2019, 6:26 PM

Please note we likely need to schedule downtime for each of those 4 hosts to shutdown and check them. @elukey: Can you advise how much notice and the process for the above? (Not sure the state/health of these services and if they can be cold shutdown or how to depool.)

elukey reassigned this task from Cmjohnson to RobH.Nov 25 2019, 6:29 PM
elukey added a subscriber: Cmjohnson.

@RobH if we could go one at the time I think that a day before the maintenance is sufficient, I'll take care of it when the time comes :)

RobH reassigned this task from RobH to Cmjohnson.Nov 25 2019, 6:30 PM

Please note that Chris will still be performing this, it needs to stay assigned to him. He will be coordinating with you for these checks. I was just doing the research to reduce the number of hosts to check (since two of them are identical).

Thanks!

@RobH my bad! Thanks a lot for the patience @Cmjohnson, I'll add more pictures to the blog post when it will be allowed to be published! :)

@RobH
Inspected stat1004 with @elukey this morning{F31467984} . 2 available slots. will fit 1 full height and 1 half height .

Announced the shutdown of stat1007 for Thu Dec 12th 15:30 CET (more or less) since it is a more crowded and used node. Since stat1004 seems to be closer to refresh time than stat1007, I'd prefer to add the GPU on stat1007 if there is space..

@RobH what John reported for stat1004 is sufficient to know if the GPU will fit or do you think that we need more info?

EBernhardson added a comment.EditedDec 11 2019, 4:25 PM

IMO the important things to check beyond physical space:

  • Does the PSU have enough overhead to support GPU's (current wx 9100 needs 250W, newer radeon vii requires ~300W). stat1005 was intentionally ordered with a beefier power supply iirc.
  • Does the system have the appropriate power plugs. (wx 9100 needs an 8 pin and 6 pin power connection, radeon vii requires 2 8-pin power connections). On stat1005 these plugs were part of the PCIe riser card.


@EBernhardson stat1004 and 1007 are 1 u host these will not fit dual-slot card . most on this list are 1u host and have same configuration

@RobH I guess that the only way forward would be to order a new host like stat1005? If so we could use part of the GPU budget for it, it shouldn't be a big amount of money, but not sure if possible of course.

elukey closed this task as Resolved.Jan 3 2020, 10:56 AM