Page MenuHomePhabricator

Hadoop Hardware Orders FY2019-2020
Closed, ResolvedPublic

Description

This is a parent ticket meant to group together the hardware orders for the Hadoop cluster for FY2019-2020 and FY2020--2022.

  • 4 x Hadoop test cluster: {T242148}
  • 6 x Hadoop GPU workers: {T242147} (Blocked until T242149 is resolved, since we need to buy/test a host with the new chassis with extra space)
  • 16 x Hadoop worker expired refresh analytics1042-1057 {T246784} *

(*) We already have them, but have problems with the discs. We have to replace/add disks.

See also the Analytics Hardware Spreadsheet

Event Timeline

Cross posting in here too:

Just to confirm the amount of 10g vs 1g NICs:

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /sys/class/net/*/speed 2>&1 | grep -v "Invalid"'
54 hosts will be targeted:
an-worker[1078-1095].eqiad.wmnet,analytics[1042-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(34) analytics[1042-1075].eqiad.wmnet
----- OUTPUT of 'cat /sys/class/n...rep -v "Invalid"' -----
1000
===== NODE GROUP =====
(20) an-worker[1078-1095].eqiad.wmnet,analytics[1076-1077].eqiad.wmnet
----- OUTPUT of 'cat /sys/class/n...rep -v "Invalid"' -----
10000

analytics[1042-1075].eqiad.wmnet are hosts running with 1G, so the old ones that we haven't replaced yet. What about analytics1069->75? They seem to be an expansion done later, that should not be refreshed before 2022. This means that eventually these hosts will work with 1G in a cluster of all 10Gs. Not a big deal but let's keep it mind.

In any case, it seems that we'd need to move a lot of hosts to 10G, and we might not have space in the racks for this move. The grand total is 34 nodes (16 + 12 + 6), plus the test cluster.

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

So to put some of the figures I just posted in IRC about this:

In eqiad 10G racks, we have the following port totals using SFP-T (and thus using 1G in a 10G rack): row a: 64, row b: 33, row c: 71, row d: 0. We have a total of 168 ports using SFP-T out of 576 10G ports available overall (48 ports per switch * 3 switches per rack * 4 racks).

Is the question on if the new 36 hosts can be 10G in eqiad? If so, it seems we have the space overall, but perhaps not the space within certain rows. Can the new nodes have their redundancy layout determined in advance of decision, to allow more info?

IE: Row b in eqiad is very full due to WMCS, so adding more 10G items in there is painful.

What matters most for us in terms of row placement is an even-ish spread. Hm, 10 of the nodes we are replacing are in Row B. We also currently only have 9 hosts in row C anyway, so perhaps we can deal with not adding any of these new nodes to row B, as long as we can evenly spread them elsewhere (with a slight preference to add a few extra to row C).

@elukey ...apparently we are not refreshing the leased hardware this year? SRE is just buying the leased servers, and then we will refresh them in two years!

FYI @RobH, we will only need 22 10G ports in eqiad now, since we are not refreshing the leased nodes this year :)

@wiki_willy We'd like to place these orders soon. I know there are issues with rackspace, etc. Can we plan out exactly what we need to do and get quotes and place orders?

We'll need 22 10G ports, spread as evenly as possible across rows, but we don't necessarily need to spread exactly evenly.

Hi @Ottomata - I could be mistaken, but I thought things were currently pending on getting the GPUs tested out first via T242149 and T238587...before proceeding with the remaining servers. (let me know though if I'm off) And to your point, there are definitely some rack space issues on 10g, which I've looking into alleviating by getting budget to add more 10g switches and/or rack space inside our data center cage. By the time the GPU is tested out in its new chassis, I'm hopeful the 10g rack space issues will be in better shape. Hope this helps, but let me know if there are additional questions/concerns. Thanks, Willy

The 6 x Hadoop GPU workers HW is blocked on the GPU testing but the other HW is not!

We'd like to order these soon and begin a Presto + Hadoop Worker colocation project on the new nodes, before we do it on the existing ones.

So I guess, can we order the '16 x Hadoop worker expired refresh analytics1042-1057' soon?

I know @elukey also wants to decom the current hadoop test cluster nodes to make some room in racks, but I think we also need the 4 x Hadoop test cluster nodes before that can happen? Not sure what the timeline is on that.

I can make an official procurement request for the 16 x Hadoop workers if so.

@Ottomata - gotcha, we should be able to proceed forward then with the 16x nodes for the refresh. Go ahead and shoot open a procurement request for them. Thanks, Willy

Ottomata added a subtask: Unknown Object (Task).Mar 3 2020, 4:00 PM
Ottomata updated the task description. (Show Details)
RobH mentioned this in Unknown Object (Task).Mar 5 2020, 5:23 PM

I had a chat with Willy about racking requirements of the new hosts (16 refreshed + GPU ones). We currently have this configuration:

ROW A : (16) an-worker[1078-1082].eqiad.wmnet,analytics[1052-1060,1070-1071].eqiad.wmnet

ROW B : (16) an-worker[1083-1087].eqiad.wmnet,analytics[1046-1051,1061-1063,1072-1073].eqiad.wmnet

ROW C: (9) an-worker[1088-1091].eqiad.wmnet,analytics[1064-1066,1074-1075].eqiad.wmnet

ROW D: (13) an-worker[1092-1095].eqiad.wmnet,analytics[1042-1045,1067-1069,1076-1077].eqiad.wmnet

The ones to refresh are:

ROW A: (6) analytics[1052-1057].eqiad.wmnet

ROW B: (6) analytics[1046-1051].eqiad.wmnet

ROW D: (4) analytics[1042-1045].eqiad.wmnet

I think that the quickest way to visualize how to place the new hosts is to remove the ones to be refreshed from the picture:

ROW A: (10) an-worker[1078-1082].eqiad.wmnet,analytics[1058-1060,1070-1071].eqiad.wmnet

ROW B: (10) an-worker[1083-1087].eqiad.wmnet,analytics[1061-1063,1072-1073].eqiad.wmnet

ROW C: (9) an-worker[1088-1091].eqiad.wmnet,analytics[1064-1066,1074-1075].eqiad.wmnet

ROW D: (9) an-worker[1092-1095].eqiad.wmnet,analytics[1067-1069,1076-1077].eqiad.wmnet

The problem is now where to put 16 + 6 nodes and keep everything as balanced as possible (all of them will need 10g ports).

Thanks for the sync up today @elukey. Just to recap here for later reference - we're running short on 10g ports in row B, so these will need to balanced out primarily across rows A, C, and D with the following rack space and 10g ports that we currently have available for new hardware (could fluctuate tho between now and the time of install):

A4 - 10u - 15ports
A7 - 4u - 4ports
C2 - 0 - 8ports
C4 - 11u - 18ports
C7 - 8u - 13ports
D2 - 3u - 3ports
D7 - 11u - 9ports

Thanks,
Willy

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 21 2020, 10:20 PM
elukey added a subtask: Unknown Object (Task).Jul 29 2020, 6:35 AM

14 x Hadoop workers - expansion for extra retention

FYI, These are part of the FY 2020-2021 orders, not 2019-2020/

elukey renamed this task from Hadoop Hardware Orders FY2019-2020 to Hadoop Hardware Orders FY2019-2020 / FY2020-2021.Aug 4 2020, 3:05 PM
elukey updated the task description. (Show Details)
elukey renamed this task from Hadoop Hardware Orders FY2019-2020 / FY2020-2021 to Hadoop Hardware Orders FY2019-2020.Aug 4 2020, 3:11 PM
elukey updated the task description. (Show Details)
Ottomata renamed this task from Hadoop Hardware Orders FY2019-2020 to Hadoop Hardware Orders FY2019-2020 / FY2020-2021.Aug 4 2020, 3:11 PM
Ottomata updated the task description. (Show Details)
elukey renamed this task from Hadoop Hardware Orders FY2019-2020 / FY2020-2021 to Hadoop Hardware Orders FY2019-2020.Aug 4 2020, 3:13 PM
elukey updated the task description. (Show Details)