EQIAD: stat1002 replacement
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Mar 7 2017, 3:33 PM

Description

Site/Location:EQIAD
Number of systems: 1
Service: stat*
Networking Requirements: internal IP, in Analytics VLAN
Processor Requirements: 16+cores
Memory: 64G
Disks: 4 x 3+TB software RAID 5
Partitioning Scheme: 30G /, 100G /tmp, rest of space for /srv (I plan on putting /home in the /srv partition (via symlink?))
Other Requirements: GPU: See T148843

This ticket tracks replacement of the OOW stat1002.

GPU

In T148843, researchers asked for access to a box with a powerful GPU in the Analytics cluster. We explored the option of installing a GPU in an existent box, but decided to wait until we replaced stat1002 and stat1003, so that we could order a chassis that would fit the GPU they want.

@Halfak and @DarTar, please work with @RobH to select the GPU you want, and to find a chassis that will fit it.

Disks

stat1002 currently has an 10TB /dev/sda. I'm not sure of the exact layout here (hardware raid?), but I think we don't actually need so much space anymore. Previously, stat1002 was the main log storage host. This functionality has mostly been moved to Hadoop. Folks now use stat1002 (and stat1003) compute resources for number crunching on local files, but they shouldn't be doing crunching on huge historical log data there anymore. We still need a large working space, but we don't need 10TB. Let's provision this in such a way that we have the most space we can, but with a fault tolerance of at least one disk. RAID 5?

4 x 3+TB disks in software RAID 5 should be fine (this is the layout we have in stat1003 now).

Related Objects
Search...

Status	Assigned	Task
Open	Miriam	T215413 Image Classification Research and Development
Resolved	elukey	T148843 Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models
Resolved	None	T157807 Reinstall Analytics Hadoop Cluster with Debian Jessie
Declined	Ottomata	T159392 Update jq to v1.4.0 or higher
Resolved	Ottomata	T152712 Replacement of stat1002 and stat1003
Resolved	RobH	T159838 EQIAD: stat1002 replacement
		Unknown Object (Task)
Resolved	Ottomata	T165368 rack/setup/install replacement to stat1005 (stat1002 replacement)
Declined	None	T151904 User limits for stat machines. Limit space on /home dir and possibly /tmp

Event Timeline

Ottomata created this task.Mar 7 2017, 3:33 PM

Ottomata added a parent task: T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models.

Ottomata mentioned this in T151080: check stat1004 (or another identical R430) for PCIe expansion space.

Ottomata updated the task description. (Show Details)Mar 7 2017, 3:39 PM

Ottomata updated the task description. (Show Details)

Ottomata triaged this task as Medium priority.Mar 7 2017, 3:41 PM

Ottomata mentioned this in T159392: Update jq to v1.4.0 or higher.Mar 7 2017, 3:49 PM

Ottomata added a parent task: T159392: Update jq to v1.4.0 or higher.

Ottomata moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.Mar 7 2017, 3:52 PM

• Nuria raised the priority of this task from Medium to High.Mar 13 2017, 3:51 PM

• Nuria moved this task from Incoming to Wikistats on the Analytics board.

• Nuria added a parent task: T152712: Replacement of stat1002 and stat1003.Mar 13 2017, 7:37 PM

Bump @RobH

So there are not going to be a lot of server chassis that can accomodate an off the shelf GPU card. Additionally, it would then not have any kind of warranty support for the card included in the system chassis.

I'd recommend we buy the GPU included with the system, which limits our availability in GPU models. We presently purchase Dell and HP systems.

The Dell offerings only have the Dell PowerEdge R730 as a system we would typically want to run in operations. We already use the R730xd system for other analytics servers, this is just a modified (less disk bays being the most major modification) of that chassis. The reduction in disk bays allows for a longer expansion card bays, so they can accomodate these cards:

http://www.dell.com/learn/us/en/04/campaigns/poweredge-gpu

All of the available Dell GPU options are there. I'd like to get some feedback from @Halfak & @DarTar on if any of those offerings are preferred (I'd assume nvida).

I've sent an email off to Dasher to have them provide me a detailed breakdown of the GPU offerings in the DL system line.

Edit Addition: I should include some background info for why I've put in the above criteria:

We want to stick with Dell or HP, since those are the system vendors we presently deploy in our datacenters.
we want to stick with the same server line within a vendor if possible. So that means the PowerEdge R line with Dell, and the DL line with HP. Shifting lines often causes some growing pains (Dell C2100 fiasco), so its best to avoid when possible.

Once we get the GPU options hammered down, all the rest of the specs are easy in comparison.

Please also note a concern was raised about the driver support of these GPU options:

In T148843#3075519, @Ladsgroup wrote:

Regarding GPU options. I just want to note that their drivers are propriety software and not open source (or partially open source with very limited options for HPC) except Intel® Xeon Phi™ coprocessor 3120P but number of cores in that GPU is smaller than other ones. Wondering if we can buy two and use both in one machine for extreme computing power.

This is a relevant concern.

Note that the Nvidia OpenCL drivers are closed-source (as the other parts of the Nvidia drivers). Note sure about AMD, but they've become far more FOSS-friendly in recent years.

RobH mentioned this in T159839: EQIAD: stat1003 replacement.Mar 16 2017, 12:08 AM

What's the status of this (and stat1003 too)? Who is this waiting on? Is it @Ottomata or @RobH?

stat1003 ticket is T159839. This stat1002 replacement ticket is waiting on feedback from @DarTar and @Halfak about acceptable GPU specs.

OK I did some digging. My assessment is that the Nvidia Tesla K80 is most desirable, but the closed source drivers will be a problem. The AMD FirePro S9150 is a possibly good alterative, but support for AMD gpus is "experimental" and I can't find any clear evidence that it actually *works* with tensor flow.

Notes:

I see the Nvidia Tesla K80 listed at http://www.dell.com/learn/us/en/04/campaigns/poweredge-gpu This is one of the recommended GPUs for Tensor Flow -- one of the ML technologies that we are targeting.

We're going to have issues with Nvidia's drivers because they are closed source. Not sure if I could say if that's a blocker or just something we're going to feel bad about. It seems that if Nvidia's drivers were not a problem, we'd definitely want to go Nvidia.

Nvidia seems to be the top recommendation.

http://timdettmers.com/2017/03/19/which-gpu-for-deep-learning/ "The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL."
http://stackoverflow.com/questions/40000518/is-intel-based-graphic-card-compatible-with-tensorflow-gpu "No"

Still, there's an experimental port of tensorflow for supporting OpenCL and therefor AMD drivers

https://github.com/benoitsteiner/tensorflow-opencl Purports to allow basic support in an opencl environment.
- "TensorFlow can only take advantage of accelerators that support OpenCL 1.2. Supported accelerators include but are not limited to: AMD Fiji, AMD Hawaii"
Note that AMD FirePro S9150 supports "OpenCL 1.2 conformance expected" and uses the "Hawaii" chip architecture.

I'd say the closed source driver is a blocker. We've blocked the use of some PCIe flash memory cards due to closed source driver usage in years past, so I'd assume its still a blocker.

I'd ask @MoritzMuehlenhoff or @faidon to comment perhaps? I'd work with the assumption that closed source is not allowed.

I'll plan on getting some quotes generated with the Dell and HP options, with the more open source friendly GPU options to start. If the closed source is approved, then I'll get the quotes updated/appended to include those as needed.

Leaving our FLOSS policy aside, the proprietary Nvidia is going to be a significant problem both in terms of

maintainability (we tend to run more recent kernels on top of our base distributions, and even while the nvidia drivers are provided in Debian non-free, they will be packaged for the kernel driver in Debian stable). Plus Debian stretch will provide signed kernel images by default and I have no idea whether that's compatible with external drivers modules.
security (it's a multi megabyte blog of closed source kernel code, over which we don't have any control)

How about Beignet, the open source drivers for Intel cards, would that work for your use case?
https://www.freedesktop.org/wiki/Software/Beignet/
https://packages.qa.debian.org/b/beignet.html

revi subscribed.Mar 22 2017, 3:55 PM

leila subscribed.Mar 22 2017, 4:52 PM

Halfak added a subscriber: dr0ptp4kt.Mar 22 2017, 4:54 PM

Yeah, we had a similar conversation over email with Adam (@dr0ptp4kt) who was also inquiring about TensorFlow. I had the same considerations that were mentioned above about NVidia binary drivers and CUDA: basically that it's both against our FL/OSS policy and also has issues with maintenance/tainting the kernel and security.

Since there is an alternative (the AMD GPU), and it doesn't sound like it would impossible to use or a ton of extra effort, and it would save us probably quite a bit of extra effort and risk on the infrastructure side, I think we a) owe it to ourselves to stick to our principles and try using hardware that has free libre drivers and b) we owe it to the FL/OSS community at large to report back and support the TensorFlow OpenCL efforts (if not us, then who?). If it turns out that it's impossible to use AMD for this, we can always step back, reconsider, and replace it with the NVidia card. It's a single GPU AIUI and the server could remain the same, so the extra cost wouldn't be impossible to bear and at that point, we would know that we gave the better option our best.

@Halfak/@dr0ptp4kt, what do you think? Do you feel comfortable with the plan above and with commiting with potentially the extra effort of using the AMD GPU?

I've emailed with a contact re: OpenCL support for the Nvidia Tesla P100 (presumably facilitated by mainstreaming of OpenCL), and also shared this ticket with the mention of AMD FirePro S9150 with OpenCL (or absent it if that's a possibility without encumbered drivers).

To be clear, it's likely that using AMD/opencl could involve "a ton of extra effort", but I don't see a good alternative given what @MoritzMuehlenhoff and @faidon have said. And hey, maybe we'll get lucky! I think we should go that direction. It seems that support for opencl in tensor flow is growing fast if it's not already going to be nice and easy for us.

@faidon, Aaron pointed that I should have a look at this conversation given that some of the work we may do with TensorFlows in the coming year may be affected by the decision on this ticket. Your proposal makes sense to me and I'm happy to give it a try.

OK, thank you all for the quick response. That sounds like a plan. @RobH, could you work with our vendors to get quotes with the AMD FirePro S9150 that @Halfak mentioned? Anything else that you need?

Thanks all! :)

RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.Mar 24 2017, 4:05 PM

Oh, I failed to see the entire raid5 comment in the initial request until now. Raid 5 is horrible for write, and I'm pretty sure only analytics uses it in our cluster. We had this discussion about raid on T159839, so I'd like to also attempt raid10 on this.

@Ottomata: what is the base requirement for storage? You state you don't need 10TB, and 4*3TB in Raid5 is fine, but you never state an actual minimum space requirement.

Assigning back to you for feedback. Once provided, please assign back to me for quotation.

Please advise,

@RobH, see the section under Disks in the task description. The storage requirements aren't so strict, but as usual, more is better. stat1002 is using ~5T of space. We should give ourselves a little room to grow. I'd say, 8T of available space would be plenty.

faidon reassigned this task from Ottomata to RobH.Mar 27 2017, 1:03 PM

@Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and and T159839 out this week? Thanks!

In T159838#3171078, @faidon wrote:

@Ottomata @RobH this seems to have been stalled somewhere between you two. Could you guys figure this and and T159839 out this week? Thanks!

It stalled out with me, I'll get the quote requests off for this right now.

We've already gotten quotes back for T159839 and I've just escalated it up for reviews.

RobH mentioned this in Unknown Object (Task).Apr 11 2017, 3:15 PM

RobH created subtask Unknown Object (Task).

RobH moved this task from In Discussion / Review to Allocation/Ordering/Implementation on the hardware-requests board.Apr 11 2017, 3:19 PM

Halfak mentioned this in T148843: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models.Apr 17 2017, 6:02 PM

@Halfak, @leila. We've got some quotes back from vendors, and have a choice between the AMD FirePro S9150 and the AMD FirePro S7150. The way the numbers and specs line up, I'm inclined to go with a quote that includes the S9150. I just want to check with yall to be sure that given these 2 options, you don't actually prefer the S7150 over the S9150. Does the S9150 sound good to you?

@Ottomata Bob and I reviewed and we are happy with your choice.

Great, thanks!

faidon closed subtask Unknown Object (Task) as Resolved.May 9 2017, 3:54 PM

This is ordered and being received in on linked procurement task, as well as setup on task T165368. As such, this hardware-requests is resolved.

Smalyshev subscribed.May 16 2017, 3:35 PM

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

EQIAD: stat1002 replacementClosed, ResolvedPublicActions

Description

GPU

Disks

Related ObjectsSearch...

Event Timeline

EQIAD: stat1002 replacement
Closed, ResolvedPublic
Actions

Related Objects
Search...