Maniphest T220700

Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	elukey
	Apr 11 2019, 2:10 PM

Description

Hi everybody,

as stated in T219842 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo the Kafka Jumbo cluster nodes have a big tx bandwidth usage (that has been growing over time), that is it getting closer to the 1Gbps limit.

I have been running ifstat on all the nodes for a while to get 1s datapoints, and as described in T219842#5087704 you can see how it is easy to reach 60/70% tx bandwidth usage, and in some cases even more:

elukey@kafka-jumbo1004:~$  cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [89][0-9]{5}'
  Time           eno1
HH:MM:SS   Kbps out
17:21:51 805023.7
19:22:01 810537.6
20:21:49 885008.4
17:22:00 879897.9
15:21:53 827064.7
17:21:55 827993.5
14:21:59 826401.5
16:22:01 801652.9
18:21:57 840396.7
16:21:51 810916.9
16:21:52 809055.7
16:21:48 805050.3
18:21:49 802069.4
20:21:49 804860.6

Moreover future usage of the hosts (like Event Gate) will add more consumers to the cluster. I know it is a real pain but it would be great if those hosts could be moved to 10G racks. Ideally this work could be done in Q1 next FY, we are not in a real rush.

We are also going to request two brokers to spread bandwidth usage, but long term I believe we'd need 10G anyway.

Thanks in advance!

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T220700 Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)
		Unknown Object (Task)
Resolved	• Cmjohnson	T236327 replace onboard NIC in kafka-jumbo100[1-6]

Event Timeline

elukey triaged this task as Medium priority.Apr 11 2019, 2:10 PM

elukey created this task.

• fdans moved this task from Incoming to Radar on the Analytics board.Apr 11 2019, 4:46 PM

From: https://netbox.wikimedia.org/dcim/devices/?q=kafka-jumbo&status=1
kafka-jumbo1002
kafka-jumbo1004
kafka-jumbo1005

Are already in 10G rows, so they would not need to be moved. Only to have a 10G card and their switch port reconfigured.

elukey moved this task from Backlog to Waiting for others on the User-Elukey board.Apr 16 2019, 10:58 AM

• Cmjohnson moved this task from Backlog to Stalled on the ops-eqiad board.Apr 16 2019, 6:24 PM

Summary of what would be needed:

kafka-jumbo1001 (A1) -> 10G card + relocation to a 10G rack
kafka-jumbo1002 (A2) -> 10G card + network configuration to use it (already in a 10G rack)
kafka-jumbo1003 (B1) -> 10G card + relocation to a 10G rack. IIRC row B was problematic on this front, so in we could have the host migrated to row D in case
kafka-jumbo1004 (C2) -> 10G card + network configuration to use it (already in a 10G rack)
kafka-jumbo1005 (C4) -> 10G card + network configuration to use it (already in a 10G rack)
kafka-jumbo1006 (D1) -> 10G card + relocation to a 10G rack

@RobH/@Cmjohnson is the above something doable?

@elukey - just wanted to follow up on this...@RobH will dig around for some quotes and recommendations

So these are all in warranty until 2020-05-31, so we will want to add in 10G NICs that are covered by Dell's system warranty. I'll link in a procurement sub-task requesting pricing for this.

As @elukey notes, we would want to move the row B server out of row B (which means new dns/ip/reimage) and into another row, since row B's 10G is highly limited due to cloudvirts.

RobH mentioned this in Unknown Object (Task).Jul 2 2019, 9:26 PM

RobH added a subtask: Unknown Object (Task).Jul 2 2019, 9:29 PM

RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.Jul 8 2019, 5:26 PM

• Cmjohnson moved this task from Stalled to Blocked on the ops-eqiad board.Jul 12 2019, 12:35 AM

wiki_willy assigned this task to • Cmjohnson.Jul 15 2019, 7:26 PM

• ayounsi removed a project: netops.Sep 24 2019, 6:00 PM

Please note T227148 has been escalated to ordering (which is the 10G NICS). These will replace the onboard NICs, so a re-image is typically best (or you have to deal with the move to a 10G rack and hardware swap without reimage which can cause issues.)

task T236327 has been filed to install the cards being ordered on T227148. Resolving this task.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Dec 3 2019, 1:08 AM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

• Cmjohnson closed subtask T236327: replace onboard NIC in kafka-jumbo100[1-6] as Resolved.Nov 5 2020, 8:21 PM

Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)
Closed, ResolvedPublic
Actions

Related Objects
Search...