Page MenuHomePhabricator

Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)
Closed, ResolvedPublic

Description

Hi everybody,

as stated in T219842 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo the Kafka Jumbo cluster nodes have a big tx bandwidth usage (that has been growing over time), that is it getting closer to the 1Gbps limit.

I have been running ifstat on all the nodes for a while to get 1s datapoints, and as described in T219842#5087704 you can see how it is easy to reach 60/70% tx bandwidth usage, and in some cases even more:

elukey@kafka-jumbo1004:~$  cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [89][0-9]{5}'
  Time           eno1
HH:MM:SS   Kbps out
17:21:51 805023.7
19:22:01 810537.6
20:21:49 885008.4
17:22:00 879897.9
15:21:53 827064.7
17:21:55 827993.5
14:21:59 826401.5
16:22:01 801652.9
18:21:57 840396.7
16:21:51 810916.9
16:21:52 809055.7
16:21:48 805050.3
18:21:49 802069.4
20:21:49 804860.6

Moreover future usage of the hosts (like Event Gate) will add more consumers to the cluster. I know it is a real pain but it would be great if those hosts could be moved to 10G racks. Ideally this work could be done in Q1 next FY, we are not in a real rush.

We are also going to request two brokers to spread bandwidth usage, but long term I believe we'd need 10G anyway.

Thanks in advance!

Event Timeline

elukey triaged this task as Medium priority.Apr 11 2019, 2:10 PM
elukey created this task.

From: https://netbox.wikimedia.org/dcim/devices/?q=kafka-jumbo&status=1
kafka-jumbo1002
kafka-jumbo1004
kafka-jumbo1005

Are already in 10G rows, so they would not need to be moved. Only to have a 10G card and their switch port reconfigured.

Summary of what would be needed:

  • kafka-jumbo1001 (A1) -> 10G card + relocation to a 10G rack
  • kafka-jumbo1002 (A2) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1003 (B1) -> 10G card + relocation to a 10G rack. IIRC row B was problematic on this front, so in we could have the host migrated to row D in case
  • kafka-jumbo1004 (C2) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1005 (C4) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1006 (D1) -> 10G card + relocation to a 10G rack

@RobH/@Cmjohnson is the above something doable?

@elukey - just wanted to follow up on this...@RobH will dig around for some quotes and recommendations

So these are all in warranty until 2020-05-31, so we will want to add in 10G NICs that are covered by Dell's system warranty. I'll link in a procurement sub-task requesting pricing for this.

As @elukey notes, we would want to move the row B server out of row B (which means new dns/ip/reimage) and into another row, since row B's 10G is highly limited due to cloudvirts.

RobH mentioned this in Unknown Object (Task).Jul 2 2019, 9:26 PM
RobH added a subtask: Unknown Object (Task).Jul 2 2019, 9:29 PM

Please note T227148 has been escalated to ordering (which is the 10G NICS). These will replace the onboard NICs, so a re-image is typically best (or you have to deal with the move to a 10G rack and hardware swap without reimage which can cause issues.)

RobH removed Cmjohnson as the assignee of this task.

task T236327 has been filed to install the cards being ordered on T227148. Resolving this task.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Dec 3 2019, 1:08 AM