Hi everybody,
as stated in T219842 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo the Kafka Jumbo cluster nodes have a big tx bandwidth usage (that has been growing over time), that is it getting closer to the 1Gbps limit.
I have been running ifstat on all the nodes for a while to get 1s datapoints, and as described in T219842#5087704 you can see how it is easy to reach 60/70% tx bandwidth usage, and in some cases even more:
elukey@kafka-jumbo1004:~$ cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [89][0-9]{5}' Time eno1 HH:MM:SS Kbps out 17:21:51 805023.7 19:22:01 810537.6 20:21:49 885008.4 17:22:00 879897.9 15:21:53 827064.7 17:21:55 827993.5 14:21:59 826401.5 16:22:01 801652.9 18:21:57 840396.7 16:21:51 810916.9 16:21:52 809055.7 16:21:48 805050.3 18:21:49 802069.4 20:21:49 804860.6
Moreover future usage of the hosts (like Event Gate) will add more consumers to the cluster. I know it is a real pain but it would be great if those hosts could be moved to 10G racks. Ideally this work could be done in Q1 next FY, we are not in a real rush.
We are also going to request two brokers to spread bandwidth usage, but long term I believe we'd need 10G anyway.
Thanks in advance!