Hadoop Hardware Orders FY2019-2020
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jan 23 2020, 3:43 PM

Description

This is a parent ticket meant to group together the hardware orders for the Hadoop cluster for FY2019-2020 and FY2020--2022.

4 x Hadoop test cluster: {T242148}
6 x Hadoop GPU workers: {T242147} (Blocked until T242149 is resolved, since we need to buy/test a host with the new chassis with extra space)
16 x Hadoop worker expired refresh analytics1042-1057 {T246784} *

(*) We already have them, but have problems with the discs. We have to replace/add disks.

See also the Analytics Hardware Spreadsheet

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T244211 Analytics Hardware for Fiscal Year 2019/2020
Resolved	Ottomata	T243521 Hadoop Hardware Orders FY2019-2020
		Unknown Object (Task)
Resolved	• Cmjohnson	T259071 (Need By: TBD) rack/setup/install an-worker11[02-17]
		Restricted Task
Resolved	elukey	T255138 Put 6 GPU-based Hadoop worker in service
Resolved	elukey	T255139 Create the new Hadoop test cluster
Resolved	• Cmjohnson	T255520 (Due By: 2020-07-17) rack/setup/install <an-test-worker1001-1003>
Resolved	• Cmjohnson	T255518 (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes
Resolved	• Cmjohnson	T227485 Decommission analytics10[28-31,33-41]
Resolved	• Cmjohnson	T233080 Decommission analytics1032
Resolved	elukey	T266064 Site: 1 VM request for Analytics test cluster
Resolved	elukey	T255140 Refresh 16 nodes in the Hadoop Analytics cluster
		Unknown Object (Task)
Resolved	• Cmjohnson	T254892 (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101]
Resolved	elukey	T262189 Create a cookbook to automate the bootstrap of new Hadoop workers

Event Timeline

Ottomata created this task.Jan 23 2020, 3:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 23 2020, 3:43 PM

Ottomata updated the task description. (Show Details)Jan 23 2020, 4:10 PM

elukey merged a task: T241190: New Hadoop hardware. Refreshes and hosts with space for GPUs.Feb 4 2020, 8:15 AM

elukey added subscribers: • Nuria, leila.

Cross posting in here too:

Just to confirm the amount of 10g vs 1g NICs:

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'cat /sys/class/net/*/speed 2>&1 | grep -v "Invalid"'
54 hosts will be targeted:
an-worker[1078-1095].eqiad.wmnet,analytics[1042-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(34) analytics[1042-1075].eqiad.wmnet
----- OUTPUT of 'cat /sys/class/n...rep -v "Invalid"' -----
1000
===== NODE GROUP =====
(20) an-worker[1078-1095].eqiad.wmnet,analytics[1076-1077].eqiad.wmnet
----- OUTPUT of 'cat /sys/class/n...rep -v "Invalid"' -----
10000

analytics[1042-1075].eqiad.wmnet are hosts running with 1G, so the old ones that we haven't replaced yet. What about analytics1069->75? They seem to be an expansion done later, that should not be refreshed before 2022. This means that eventually these hosts will work with 1G in a cluster of all 10Gs. Not a big deal but let's keep it mind.

In any case, it seems that we'd need to move a lot of hosts to 10G, and we might not have space in the racks for this move. The grand total is 34 nodes (16 + 12 + 6), plus the test cluster.

elukey updated the task description. (Show Details)Feb 4 2020, 9:01 AM

elukey mentioned this in T244211: Analytics Hardware for Fiscal Year 2019/2020.Feb 4 2020, 10:07 AM

• fdans triaged this task as High priority.Feb 10 2020, 5:40 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

So to put some of the figures I just posted in IRC about this:

In eqiad 10G racks, we have the following port totals using SFP-T (and thus using 1G in a 10G rack): row a: 64, row b: 33, row c: 71, row d: 0. We have a total of 168 ports using SFP-T out of 576 10G ports available overall (48 ports per switch * 3 switches per rack * 4 racks).

Is the question on if the new 36 hosts can be 10G in eqiad? If so, it seems we have the space overall, but perhaps not the space within certain rows. Can the new nodes have their redundancy layout determined in advance of decision, to allow more info?

IE: Row b in eqiad is very full due to WMCS, so adding more 10G items in there is painful.

What matters most for us in terms of row placement is an even-ish spread. Hm, 10 of the nodes we are replacing are in Row B. We also currently only have 9 hosts in row C anyway, so perhaps we can deal with not adding any of these new nodes to row B, as long as we can evenly spread them elsewhere (with a slight preference to add a few extra to row C).

@elukey ...apparently we are not refreshing the leased hardware this year? SRE is just buying the leased servers, and then we will refresh them in two years!

FYI @RobH, we will only need 22 10G ports in eqiad now, since we are not refreshing the leased nodes this year :)

@wiki_willy We'd like to place these orders soon. I know there are issues with rackspace, etc. Can we plan out exactly what we need to do and get quotes and place orders?

We'll need 22 10G ports, spread as evenly as possible across rows, but we don't necessarily need to spread exactly evenly.

Hi @Ottomata - I could be mistaken, but I thought things were currently pending on getting the GPUs tested out first via T242149 and T238587...before proceeding with the remaining servers. (let me know though if I'm off) And to your point, there are definitely some rack space issues on 10g, which I've looking into alleviating by getting budget to add more 10g switches and/or rack space inside our data center cage. By the time the GPU is tested out in its new chassis, I'm hopeful the 10g rack space issues will be in better shape. Hope this helps, but let me know if there are additional questions/concerns. Thanks, Willy

The 6 x Hadoop GPU workers HW is blocked on the GPU testing but the other HW is not!

We'd like to order these soon and begin a Presto + Hadoop Worker colocation project on the new nodes, before we do it on the existing ones.

So I guess, can we order the '16 x Hadoop worker expired refresh analytics1042-1057' soon?

I know @elukey also wants to decom the current hadoop test cluster nodes to make some room in racks, but I think we also need the 4 x Hadoop test cluster nodes before that can happen? Not sure what the timeline is on that.

I can make an official procurement request for the 16 x Hadoop workers if so.

@Ottomata - gotcha, we should be able to proceed forward then with the 16x nodes for the refresh. Go ahead and shoot open a procurement request for them. Thanks, Willy

Ottomata added a project: Analytics-Kanban.Mar 3 2020, 3:46 PM

Ottomata added a subtask: Unknown Object (Task).Mar 3 2020, 4:00 PM

Ottomata updated the task description. (Show Details)

RobH mentioned this in Unknown Object (Task).Mar 5 2020, 5:23 PM

elukey added a subtask: T244211: Analytics Hardware for Fiscal Year 2019/2020.Mar 27 2020, 4:21 PM

elukey removed a subtask: T244211: Analytics Hardware for Fiscal Year 2019/2020.

elukey added a parent task: T244211: Analytics Hardware for Fiscal Year 2019/2020.

I had a chat with Willy about racking requirements of the new hosts (16 refreshed + GPU ones). We currently have this configuration:

ROW A : (16) an-worker[1078-1082].eqiad.wmnet,analytics[1052-1060,1070-1071].eqiad.wmnet

ROW B : (16) an-worker[1083-1087].eqiad.wmnet,analytics[1046-1051,1061-1063,1072-1073].eqiad.wmnet

ROW C: (9) an-worker[1088-1091].eqiad.wmnet,analytics[1064-1066,1074-1075].eqiad.wmnet

ROW D: (13) an-worker[1092-1095].eqiad.wmnet,analytics[1042-1045,1067-1069,1076-1077].eqiad.wmnet

The ones to refresh are:

ROW A: (6) analytics[1052-1057].eqiad.wmnet

ROW B: (6) analytics[1046-1051].eqiad.wmnet

ROW D: (4) analytics[1042-1045].eqiad.wmnet

I think that the quickest way to visualize how to place the new hosts is to remove the ones to be refreshed from the picture:

ROW A: (10) an-worker[1078-1082].eqiad.wmnet,analytics[1058-1060,1070-1071].eqiad.wmnet

ROW B: (10) an-worker[1083-1087].eqiad.wmnet,analytics[1061-1063,1072-1073].eqiad.wmnet

ROW C: (9) an-worker[1088-1091].eqiad.wmnet,analytics[1064-1066,1074-1075].eqiad.wmnet

ROW D: (9) an-worker[1092-1095].eqiad.wmnet,analytics[1067-1069,1076-1077].eqiad.wmnet

The problem is now where to put 16 + 6 nodes and keep everything as balanced as possible (all of them will need 10g ports).

Thanks for the sync up today @elukey. Just to recap here for later reference - we're running short on 10g ports in row B, so these will need to balanced out primarily across rows A, C, and D with the following rack space and 10g ports that we currently have available for new hardware (could fluctuate tho between now and the time of install):

A4 - 10u - 15ports
A7 - 4u - 4ports
C2 - 0 - 8ports
C4 - 11u - 18ports
C7 - 8u - 13ports
D2 - 3u - 3ports
D7 - 11u - 9ports

Thanks,
Willy

elukey moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.Jun 10 2020, 2:21 PM

elukey moved this task from Q4 2019/2020 to Backlog on the Analytics-Clusters board.Jun 10 2020, 2:40 PM

elukey removed a project: Analytics-Clusters.Jun 10 2020, 2:52 PM

elukey mentioned this in T255140: Refresh 16 nodes in the Hadoop Analytics cluster.Jun 11 2020, 1:33 PM

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 21 2020, 10:20 PM

elukey mentioned this in T259071: (Need By: TBD) rack/setup/install an-worker11[02-17].Jul 29 2020, 6:33 AM

elukey mentioned this in T254892: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101].

elukey added a subtask: Unknown Object (Task).Jul 29 2020, 6:35 AM

elukey updated the task description. (Show Details)Aug 4 2020, 2:08 PM