Page MenuHomePhabricator

GPU upgrade for stat1005
Closed, ResolvedPublic

Description

Hi!

The Analytics team is looking for a new (more supported) GPU for stat1005. I have been discussing this with Rob and Faidon during the past days, creating a task to collect all the info and decide how to proceed.

Before starting - on stat1005 there is already a GPU, but it was bought (IIUC) as part of the host as pre-packaged solution from Dell. I have no idea if the card is integrated or attached to a PCI slot. We have been working in T148843 to see if the GPU could work with the recent AMD drivers but so far no luck, plus the GPU is not well supported anymore.

Ideally this should be a first test use case to start thinking a bit more about adding a GPU to all the computing nodes of Analytics (stat machines first, then possibly hadoop worker nodes in the bright future).

In T148843 @EBernhardson added some comments about GPUs that could be a good fit for us:

Radeon VII 16GB: This is a brand new card released this month, 16GB on a consumer card is almost unheard of (and google is still laughing with their 64GB tpu's). We might almost be too early, retail is $700 but I'm not seeing the 16GB version anywhere yet. Random internet blog claims "the stocks for AMD’s new 7nm Radeon VII graphics card were very limited with less than 5,000 units available globally at launch". There might also be more driver problems, it looks like the drivers for Radeon VII similarly only landed in the linux kernel a couple weeks ago: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=8bf607ca9bb42bc26bd4c8d574d88b6b6935c47d

RX Vega 64 8GB: This is amd's previous top of the line consumer card, these should be available almost anywhere. This is a perfectly acceptable card, but double the memory on the next step up is huge.

Looks like one more option, a workstation card from AMD, the Vega Frontier has 16GB of memory with very similar compute to the Vega 64. Unfortunately this seems to have been discontinued, the current choice for 16GB memory is AMD Radeon Pro WX 9100. This is about double the price of a vega 64, with the primary difference being double the memory. If that is important for this test or not is hard to say. My limited understanding is that many of the models publicly available for fine tuning are built to fit a training batch in either 8, 12 or 16gb of memory. 8gb might be restrictive in some cases, but I've not enough experience to say when exactly.

We'd need to find something that is compatible with this hardware list: https://rocm.github.io/hardware.html. Ideally the GFX 9 series would be the right target:

“Vega 10” chips, including the following GPUs:
AMD Radeon RX Vega 56
AMD Radeon RX Vega 64
AMD Radeon Vega Frontier Edition
AMD Radeon Pro WX 8200
AMD Radeon Pro WX 9100
AMD Radeon Pro V340
AMD Radeon Pro V340 MxGPU
AMD Radeon Instinct MI25
Note that ROCm does not support the Radeon Pro SSG

Now the hard part. @RobH mentioned to me on IRC that we'd need to find a good compromise between space occupation within the chassis and extra power consumption (ideally a small card and no extra power consumption). Dell offers some GPU models that are compatible with their PowerEdge servers, but the list seems mentioning only models from the GFX7 series that is enabled but not supported by the RocM drivers (the same series of the GPU card on stat1005, that indeed seems not working with the last drivers):

https://dell.com/learn/us/en/04/campaigns/poweredge-gpu#campaignTabs-1

Let's try to find a good compromise :)

Event Timeline

I've been lurking on this issue, and just wanted chime in with one bit of information I learned through experience. I'm not sure what model of Dell server you are using, but there are some subtle issues with fitting GPUs in a 2U server (and maybe others). In my case, a Nvidia GeForce card (which I know you aren't considering) technically fit in my 2U server, but the power location for the card meant it extruded beyond the enclosure.

It looks to me like some of the GFX 9 have a similar problem. To be specific, compare the power locations on a card like the Vega 64 (https://static.techspot.com/images2/news/bigimage/2017/07/2017-07-30-image-6.jpg) to a card like the the MI 25 (https://www.avadirect.com/Radeon-Instinct-MI25-16GB-HBM2-Passive-Cooling-GPU-Computing-Accelerator/Product/12422324). The vertical orientation of power in a GPU similar to the Vega 64 caused problems in my 2U chassis.

I'm sure this varies from chassis to chassis and card to card, but I just wanted to raise it as a possible concern. I'm looking forward to having a GPU available!

Thanks for the input @Shilad, its much appreciated! That info is EXACTLY the kind of info we need (and why this task exists!)

To echo what some of us discussed about this in irc yesterday, we're going to have a few issues/considerations to take into account for this:

  • The existing R730 has a GPU that is added via a Dell specific daughterboard. We cannot swap it out so much as disable it and use the PCIe slot if needed.
  • PowerEdge servers (and indeed nearly all rack mount servers) tend to NOT have the extra power cable needed to power third party GPU cards.
  • Third party GPU cards often take up more than 1 slot (due to the heatsink size), and they need full size slots much of the time.
  • Cooling for the GPU card (and the power supplies to run it) are not going to be good enough in most 1U systems. This will likely require a 2U system simply for the airflow and card form factor requirements.

Some of the Dell GPU offerings built in are listed here:
https://www.dell.com/learn/us/en/04/campaigns/poweredge-gpu#campaignTabs-1

As for adding third party cards (rather than using dell provided GPUs), ideally we would find one that only requires a single PCIe slot and is 100% powered from the PCIe bus (no added power connectors.)

colewhite triaged this task as Medium priority.Feb 15 2019, 7:35 PM

Thanks for the input @Shilad, its much appreciated! That info is EXACTLY the kind of info we need (and why this task exists!)

Prompted by shilad i looked around, and it seems like the power connector on the rear of the card is standard for datacenter gpu's. Consumer gpus, like the vega 64, seem to vary and typically be on the top. The current S9150, afaict, should have it's power connector on the rear and likely the existing power cabling we have only fits that location. Video linked below showing insides of the r730 case with gpu seems to confirm this.

To echo what some of us discussed about this in irc yesterday, we're going to have a few issues/considerations to take into account for this:

  • The existing R730 has a GPU that is added via a Dell specific daughterboard. We cannot swap it out so much as disable it and use the PCIe slot if needed.

Daughterboards are very common for gpu's, typically they are very standard pcie riser's, but in a particular form factor for the case. I have to claim ignorance here as well, but I would be surprised if that daughterboard was any more than a pcie riser. Unfortunately dell only supporting ancient amd gpus is fairly standard. Everyone buys nvidia and that is reflected here with dell offering the full suite of top end nvidia cards and a selection of ancient amd cards.

Referring back to the packing slip in T162700#3248287 it mentions three pcie riser cards for right, center and left alternate. Perhaps worrying is it lists heatsink for GPU's as a separate item, so there might be a custom cooling solution going on?

EDIT: I tracked down the item labeled "374-BBHN Heatsink for GPU's PowerEdge R7". This is a "ASSEMBLY, HEATSINK, CENTRAL PROCESSOR UNIT". Somewhere else confirms this is basically a low profile cpu cooler that allows the server to fit the full size gpu cards. Basically that should mean there is no custom gpu cooling to worry about.

  • PowerEdge servers (and indeed nearly all rack mount servers) tend to NOT have the extra power cable needed to power third party GPU cards.

I have to claim ignorance, but there must be some hookups for the existing S9150 to have extra power? The S9150 already in there is a 235W card that requires extra power over an 8-pin connector. The packing slip referenced above includes "490-bdcp r730 gpu installation kit". I can't find that exact model number unfortunately, but dell's current product with the same name is basically the extra power cables.

EDIT: "490-BDCP GPU Installation Kit", included in stat1005, is these power cables. They attach to 8-pin power connectors that are part of the PCIe riser cards that were also included in stat1005.

  • Third party GPU cards often take up more than 1 slot (due to the heatsink size), and they need full size slots much of the time.
  • Cooling for the GPU card (and the power supplies to run it) are not going to be good enough in most 1U systems. This will likely require a 2U system simply for the airflow and card form factor requirements.

The existing S9150 should also be dual-slot full-height full-width? It's not clear from dell, but either the R730 or the R730xd is a 2U case with room for 2 dual-slot full-height full-width cards.

A video showing the insides of a poweredge r730 along with the pcie risers and the gpu power cables. I believe stat1005 is the r730xd though, and not sure the difference: https://www.youtube.com/watch?v=lO4FmOh404k

Some of the Dell GPU offerings built in are listed here:
https://www.dell.com/learn/us/en/04/campaigns/poweredge-gpu#campaignTabs-1

As for adding third party cards (rather than using dell provided GPUs), ideally we would find one that only requires a single PCIe slot and is 100% powered from the PCIe bus (no added power connectors.)

Unfortunately any gpu that has a single slot and is powered by the pcie bus is going to be underpowered. The GPU is generally more power hungry than the main processor, and PCIe 3.0 can only deliver 75W. These cards are typically in the 200-300W range. Additionally while the packing slip doesn't seem to mention it, the approved quote lists an 1100W power supply which is only necessary for these power hungry gpu's (nVidia does make a datacenter gpu that doesn't require external power, the T4 at 70W. But i cant find a similar product from AMD)

The summary seems to be that any GPU compute nodes will have to be specifically bought for that purpose, with 2U cases like the r730xd that are built to fit and power these cards. stat1005 was purchased with these intentions and seems to include everything necessary to plug in a new card as long as it has the same dimensions and power connector location.

In terms of exact cards that meet these requirements, the options seem to be limited. While they probably exist, I'm not having much luck finding an RX vega 64 with the rear power connector. Similarly the Radeon VII seems to have the power connector on top as well. Being so new there aren't even other options to consider, only 1 variant of this has been released afaict.

That doesn't actually leave us with too many options. The datacenter cards listed by ROCm as supported are the MI25( >$4k) and the V340 (two gpus in one, >$10k). Stepping away from the datacenter specific cards the WX 9100 (>$1k) for workstations appears to be a good fit. The power connector is on the rear like the datacenter cards. The AMD datasheet for wx 9100 has dimensions of 4.4" x 10.5", double slot. The S9150 datasheet doesn't say, but elsewhere i'm finding 4.4" x 10.5", double slot (not sure how trustworthy that is). . The WX9100 power usage at 230W is a close match to the S9150 it's replacing at 235W.

Thanks all for all the detailed info!

One thought: I found this interesting use case https://www.amd.com/en/case-studies/school-42 among the case studies in the AMD website, that seems to be a non-profit organization to help learning how to code. There is a link about contacting an expert from AMD, that in my opinion would be a good way to know more information about what card to buy for our use case.

fdans raised the priority of this task from Medium to High.Feb 18 2019, 4:52 PM
fdans added a project: Analytics-Kanban.
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Thanks all for all the detailed info!

One thought: I found this interesting use case https://www.amd.com/en/case-studies/school-42 among the case studies in the AMD website, that seems to be a non-profit organization to help learning how to code. There is a link about contacting an expert from AMD, that in my opinion would be a good way to know more information about what card to buy for our use case.

So a quick grep of that site's GPU offerings and form factors show most of them to be very large, but the smaller ones that may fit are:

https://www.amd.com/en/products/graphics/radeon-rx-550 - most likely
https://www.amd.com/en/products/graphics/radeon-rx-560 - maybe one of the models, but power draw is high

Unfortunately the rx 550 and 560 mentioned have 4GB of memory, which is basically a show stopper. We would be able to train some of our own custom models, but more common is to download a large pre-trained model and perform some fine tuning of that model. Those models almost always need more than 4GB of memory for an image training batch and all the necessary buffers.

In https://github.com/RadeonOpenCompute/ROCm/issues/714#issuecomment-465666946 the upstream developers of the AMD drivers told me that our GPU on stat1005 is basically a dead end, so we should concentrate in buying a new one and then restart testing :(

Another thing to take away from the upstream response is that debian is unsupported. I can't imagine deploying ubuntu to a single machine will be an accepted way forward so we will have to make it work. Maybe worst case we can get the kernel module working in debian and use a containerized ubuntu, but will like to try and make it work in debian first.

@RobH I'd like to establish the next steps for this task, so we can order a new GPU and test it as soon as possible, since a lot of the new Fiscal Year planning will involve hardware for Machine-Learning etc..

Now that we have an idea about how stat1005 looks like (space/power available, etc..) can we evaluate other cards like the ones brought up previously by Erik? (not sure if they will not fit in stat1005 for sure, asking to make it clear in this task :)).

I have submitted a support request to an "AMD Expert" and will see what they have to say about GPU's that are physically compatible with our servers.

Nuria raised the priority of this task from High to Needs Triage.Feb 28 2019, 6:19 PM
Milimetric triaged this task as High priority.
Milimetric removed a project: Analytics-Kanban.

@RobH, @EBernhardson - while we wait for a response from AMD, I'd also like to understand if T216528 gave us more info about the possibility of ordering a GPU like RX vega 64 via regular vendors (I tried to understand it by myself reading tasks/infos/etc.. but I didn't come up with a solid answer :)

At this point I'm not really expecting a response from AMD anymore, generally i would expect under a week but the only response I've gotten so far is the automated response saying someone will contact me shortly.

Based on the pictures Chris provided I can note:

  • The dimensions are standard full-height double slot.
  • There is most likely enough space for the power plugs to be connected to the top of the card, as in consumer gpus.
  • The power cable itself has sufficient length to connect to the top of the card as well
  • Connecting the power to the top of the card may make some changes to airflow through the case, probably fine.
  • I'm not entirely clear on the additional L bracket that extends past the end of the PCB, it appears to help hold the GPU in place. It may be more of a shipping concern than a sitting in the rack concern, but I'm not certain.

Overall, at this point I'd be willing to suggest we order an RX Vega 64 and see how it goes. If we can find one with the power connector on the rear like the datacenter cards that would be ideal, but I think we can use a standard consumer card with the power on the top as well.

The differences between the RX Vega 64 and the WX9100 look to be double the memory on the WX9100, and slightly more compute on the RX Vega 64. With the WX9100 marketed as a workstation card it is clocked slightly lower and will run a smidgen cooler. 8GB vs 16GB of memory is a significant difference, but the WX9100 is also ~2.5x the price of an RX Vega 64.

Thanks @EBernhardson!

@RobH What do you think? Would it be feasible for you to check from our vendors if we can get a RX Vega 64?

So the only vendor we have terms with that may have this is https://neweggbusiness.com

https://www.neweggbusiness.com/product/productlist.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&N=-1&isNodeId=1&Description=RX+Vega+64

Asus seems to have one available for eithe r$750 or $785 depending on model. I have SERIOUS doubts it will fit, and there is a good chance this is a $800 test. While we could try to return if it doesn't work, there will be a minimum 15% restocking, if its accepted for return (if we don't end up using it.) I'd look at this as a $800 proof of concept with no way to recoup if it doesn't work (much like the GPU before it! ;)

I am ok with the $800 cost of this proof of concept, benefit in terms of engineering hours if this is to work is pretty significant

RobH mentioned this in Unknown Object (Task).Mar 5 2019, 4:06 PM

Those cards afaics are only 8G, meanwhile we'd need 16G (if possible). The only model that would suit us that I found is:

https://www.neweggbusiness.com/product/product.aspx?item=9b-14-105-087

But it costs more, ~1300 dollars

Newegg looks to only have models with aftermarket cooling in stock,I would have some size concerns with these as well. Ideally we want the stock form factor, like the (out of stock) sapphire card listed. If this is our only choice of supplier we could perhaps try, but I would also be less certain.

Those cards afaics are only 8G, meanwhile we'd need 16G (if possible). The only model that would suit us that I found is:

https://www.neweggbusiness.com/product/product.aspx?item=9b-14-105-087

But it costs more, ~1300 dollars

Right, the WX will have double the memory. If we can't get the RX Vega 64 in the stock form factor I think this is more likely to work for us.

Can you guys (as a team) decide on a GPU card and provide me the exact model and a url for purchase?

All pricing should technically take place in the S4 space, so I created a sub task, T217668 for tracking this order.

(As it stands, I've already changed the model/URL of the card twice on the sub-task and now it seems both times were wrong.)

So we aren't 100% certain from photos on T216528 if the left hand side PCIe riser has one or two slots on the back of the chassis.

Chris will update T216528 with a photo to show the back of the chasiss in its entirety.

If only one slot is confirmed, I'm not sure there is a way forward. As far as i can tell even the datacenter gpu's (MI10, MI25) require the second slot.

If dual slot is confirmed we have two options:

WX9100 from neweggbusiness where we have existing terms. 16GB memory, just shy of $1500 shipped: https://www.neweggbusiness.com/product/product.aspx?item=9b-14-105-087

Vega RX 64 from newegg (not sure why business site shows it sold out). 8GB memory, Just shy of $750 shipped: https://www.newegg.com/Product/Product.aspx?Item=N82E16814930007&ignorebbr=1

Poked a couple people in research who would be using the card about the difference between 8GB and 16GB memory:

09:00 <+dsaez> ebernhadson, in this case bigger is always better ... 
09:01 <+miriam_> ebernhardson hi! I don't think we *need* 16GB, it would be better, but 8 GB is acceptable, we will just use smaller batch size.

This says to me that the difference between 8GB and 16GB is primarily how much we want to spend on this test case, either will work.

https://phabricator.wikimedia.org/T216528#5027798 includes a picture of the rear of the case. This confirms a dual-slot card will fit into one side of the chasis. There will be no room for a second GPU in this configuration.

I recommend moving forward with the 16GB WX9100.

elukey changed the task status from Open to Stalled.Mar 28 2019, 9:21 AM

Pending hardware procurement in https://phabricator.wikimedia.org/T217668

As the hardware order is pending, and T219522 is setup for the installation, I'm reassigning this to @elukey as there is nothing more pending for me to do at this point in time.

I didn't want to just resolve this, since it's been acting as a tracking task.

Basically this can be resolved as soon as @elukey is happy with it. It can stay open until after the new hardware is installed if preferred.

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Apr 1 2019, 6:02 PM