Page MenuHomePhabricator

Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible)
Closed, ResolvedPublic

Description

Hi everybody,

as stated in T219842 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo the Kafka Jumbo cluster nodes have a big tx bandwidth usage (that has been growing over time), that is it getting closer to the 1Gbps limit.

I have been running ifstat on all the nodes for a while to get 1s datapoints, and as described in T219842#5087704 you can see how it is easy to reach 60/70% tx bandwidth usage, and in some cases even more:

elukey@kafka-jumbo1004:~$  cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [89][0-9]{5}'
  Time           eno1
HH:MM:SS   Kbps out
17:21:51 805023.7
19:22:01 810537.6
20:21:49 885008.4
17:22:00 879897.9
15:21:53 827064.7
17:21:55 827993.5
14:21:59 826401.5
16:22:01 801652.9
18:21:57 840396.7
16:21:51 810916.9
16:21:52 809055.7
16:21:48 805050.3
18:21:49 802069.4
20:21:49 804860.6

Moreover future usage of the hosts (like Event Gate) will add more consumers to the cluster. I know it is a real pain but it would be great if those hosts could be moved to 10G racks. Ideally this work could be done in Q1 next FY, we are not in a real rush.

We are also going to request two brokers to spread bandwidth usage, but long term I believe we'd need 10G anyway.

Thanks in advance!

Event Timeline

elukey triaged this task as Medium priority.Apr 11 2019, 2:10 PM
elukey created this task.
fdans moved this task from Incoming to Radar on the Analytics board.Apr 11 2019, 4:46 PM

From: https://netbox.wikimedia.org/dcim/devices/?q=kafka-jumbo&status=1
kafka-jumbo1002
kafka-jumbo1004
kafka-jumbo1005

Are already in 10G rows, so they would not need to be moved. Only to have a 10G card and their switch port reconfigured.

Cmjohnson moved this task from Backlog to Stalled on the ops-eqiad board.Apr 16 2019, 6:24 PM

Summary of what would be needed:

  • kafka-jumbo1001 (A1) -> 10G card + relocation to a 10G rack
  • kafka-jumbo1002 (A2) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1003 (B1) -> 10G card + relocation to a 10G rack. IIRC row B was problematic on this front, so in we could have the host migrated to row D in case
  • kafka-jumbo1004 (C2) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1005 (C4) -> 10G card + network configuration to use it (already in a 10G rack)
  • kafka-jumbo1006 (D1) -> 10G card + relocation to a 10G rack

@RobH/@Cmjohnson is the above something doable?

@elukey - just wanted to follow up on this...@RobH will dig around for some quotes and recommendations

RobH added a comment.Jul 2 2019, 9:23 PM

So these are all in warranty until 2020-05-31, so we will want to add in 10G NICs that are covered by Dell's system warranty. I'll link in a procurement sub-task requesting pricing for this.

As @elukey notes, we would want to move the row B server out of row B (which means new dns/ip/reimage) and into another row, since row B's 10G is highly limited due to cloudvirts.

RobH mentioned this in Unknown Object (Task).Jul 2 2019, 9:26 PM
RobH added a subtask: Unknown Object (Task).Jul 2 2019, 9:29 PM
Cmjohnson moved this task from Stalled to Blocked on the ops-eqiad board.Jul 12 2019, 12:35 AM
RobH added a comment.Oct 23 2019, 9:08 PM

Please note T227148 has been escalated to ordering (which is the 10G NICS). These will replace the onboard NICs, so a re-image is typically best (or you have to deal with the move to a 10G rack and hardware swap without reimage which can cause issues.)

RobH closed this task as Resolved.Oct 23 2019, 9:16 PM
RobH removed Cmjohnson as the assignee of this task.

task T236327 has been filed to install the cards being ordered on T227148. Resolving this task.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Dec 3 2019, 1:08 AM