Hi everybody,
as stated in T219842 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo the Kafka Jumbo cluster nodes have a big tx bandwidth usage (that has been growing over time), that is it getting closer to the 1Gbps limit.
I have been running ifstat on all the nodes for a while to get 1s datapoints, and as described in T219842#5087704 you can see how it is easy to reach 60/70% tx bandwidth usage, and in some cases even more:
elukey@kafka-jumbo1004:~$ cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [89][0-9]{5}'
Time eno1
HH:MM:SS Kbps out
17:21:51 805023.7
19:22:01 810537.6
20:21:49 885008.4
17:22:00 879897.9
15:21:53 827064.7
17:21:55 827993.5
14:21:59 826401.5
16:22:01 801652.9
18:21:57 840396.7
16:21:51 810916.9
16:21:52 809055.7
16:21:48 805050.3
18:21:49 802069.4
20:21:49 804860.6Moreover future usage of the hosts (like Event Gate) will add more consumers to the cluster. I know it is a real pain but it would be great if those hosts could be moved to 10G racks. Ideally this work could be done in Q1 next FY, we are not in a real rush.
We are also going to request two brokers to spread bandwidth usage, but long term I believe we'd need 10G anyway.
Thanks in advance!