Page MenuHomePhabricator

New WDQS clusters eqiad + codfw
Closed, ResolvedPublic

Description

As discussed in T178492 we want to create isolated WDQS clusters for internal / trusted traffic. This should replicate the existing wdqs clusters.

Specs (can be adapted to fit a more standard configuration):

  • 3 nodes in codfw + 3 nodes in eqiad = 6 nodes
  • CPU 16 core hyperthreaded (currently Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)
  • 128 GB RAM
  • 800GB of available RAIDed SSD (2x800GB SSD)

@EBjune can you confirm we have a budget for this?

Details for hw-request/procurement

This task was created and currently lives in the public space, in #hw-requests. As such, it cannot have quote/totals posted directly into this task. All price discussion will have to take place in the procurement sub-tasks.

Since this is for two sites, there will have to be two procurement tasks (one task per order.) I'll (@RobH) will create them and link them off this task. I'll only process 1 of the 2 right away, and get a working quote/specification with the vendor.

The last order of wdqs systems was on T166780, however the quote from that task CANNOT be linked in this public task. Rather, I'll just list off the hardware of the past system here, but the sub-task for a codfw quote will follow and have full pricing details:

wdqs100[56] are the most recent purchase of wdqs systems.

  • Dell Poweredge R430
    • Please note that the R430 only allows 4 dimm slots dedicated to CPU2 (there are 12 slots total, but 8 go to CPU 1 only). This config has 4 total dimms, 2 going to CPU2.
  • Dual Intel Xeon E5-2620 v4 (2.10GHz 8 cores)
  • 128GB RAM
  • no hardware raid, onboard sata controller only
  • Dual Intel S3520 800 GB SSD
  • 1 Gbit (onboard) network card (4 ports)
  • drac enterprise
  • dual psu (3' c13/c14 power cables)
  • no bezel, no optical
  • rails without cable management

All quotations will be on the sub-task for the codfw system purchase, task T183201. NOT on this task. Please see sub-task T183201 for all followup. Once the codfw quotation is agreed upon, @RobH will make another sub-task for the eqiad purchase.

Event Timeline

RobH changed the task status from Open to Stalled.Dec 18 2017, 10:21 PM
RobH triaged this task as Medium priority.
RobH updated the task description. (Show Details)
RobH mentioned this in Unknown Object (Task).Dec 18 2017, 10:24 PM
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Stalled on the hardware-requests board.
RobH moved this task from Stalled to In Discussion / Review on the hardware-requests board.

Can you perhaps briefly explain how the specs compare to the existing WDQS clusters? Because I would assume that the internal clusters will see much less traffic than the external ones.

Can you perhaps briefly explain how the specs compare to the existing WDQS clusters? Because I would assume that the internal clusters will see much less traffic than the external ones.

Sizing has a fairly large crystal ball component :) Still, here are a few of the rationals:

  • We have the same amount of data, so same SSD size (obvious)
  • We still want 2 CPUs because that's our standard and wasting a socket usually does not make much sense (for the details of that reasoning, ask Rob, I barely understand what I'm talking about here)
  • We might be able to get away with less RAM, but seeing the current workload, where a single query can start allocating like crazy, it seems unlikely that we could allocate much less to the JVM (we need to size for peak consumption, not average). Also, less RAM => less cache, so probably a higher per query latency.
  • We are trying to standardize our server builds (easier management, spares, ...)

Okay, all of that sounds reasonable enough :) thank you!

Gehel added a subtask: Unknown Object (Task).Feb 20 2018, 9:37 AM
Gehel closed subtask Unknown Object (Task) as Resolved.Apr 6 2018, 9:25 AM
Gehel closed subtask Unknown Object (Task) as Resolved.
Smalyshev subscribed.

I think this is done now.