Page MenuHomePhabricator

Improve kafka main 's partitions usage and leaders using topicmappr's rebalance
Open, LowPublic

Description

In T341558 the topicmappr command was used to selectively move kafka topic partitions from one broker to the other ones, on both main clusters. We got positive results, but topicmappr also offers a rebalance command (as opposed to rebuild that we used) to:

  • Fetch metrics from Prometheus about free storage and kafka log partitions size.
  • Optimize leadership distribution moving as few partitions around as possible.

Two projects are needed:

I'd like to test rebalance on kafka-main codfw (and possibly eqiad afterwards) to see if they can be useful tools for more targeted use cases.

Event Timeline

I have built both binaries needed (metrics-fetcher and topicmappr version 4.2.1) on my laptop on a pristine Debian bullseye container, and uploaded them to kafka-main2001 (if they work fine I'll package them etc.. of course).

First step is to fetch and generate metrics (they will be stored in zookeeper, very low footprint):

elukey@kafka-main2001:~/T345077$ ./metrics-fetcher --prometheus-url http://prometheus.svc.codfw.wmnet/ops/ --zk-addr conf2004.codfw.wmnet:2181 --partition-size-query "max(kafka_log_Size{cluster=\"kafka_main\"}) by (topic,partition)" --broker-id-map "kafka-main2001:9100=2001,kafka-main2002:9100=2002,kafka-main2003:9100=2003,kafka-main2004:9100=2004,kafka-main2005:9100=2005" --broker-id-label instance --broker-storage-query "node_filesystem_avail_bytes{cluster=\"kafka_main\", site=\"codfw\",device=\"/dev/mapper/vg0-srv\"}"

time="2023-08-28T13:32:42.432Z" level=info msg="Getting broker storage stats from Prometheus"
time="2023-08-28T13:32:42.433Z" level=info msg="Connected to [2620:0:860:102:10:192:16:45]:2181" logger=zk
time="2023-08-28T13:32:42.434Z" level=info msg="authenticated: id=-3385087981172555648, timeout=20000" logger=zk
time="2023-08-28T13:32:42.435Z" level=info msg="re-submitting `0` credentials after reconnect" logger=zk
time="2023-08-28T13:32:42.456Z" level=info msg="Broker ID Map: map[kafka-main2001:9100:2001 kafka-main2002:9100:2002 kafka-main2003:9100:2003 kafka-main2004:9100:2004 kafka-main2005:9100:2005]"
time="2023-08-28T13:32:42.456Z" level=info msg="Getting partition sizes from Prometheus"
time="2023-08-28T13:32:42.587Z" level=info msg="writing data to /topicmappr/partitionmeta"
time="2023-08-28T13:32:42.595Z" level=info msg="writing data to /topicmappr/brokermetrics"
time="2023-08-28T13:32:42.598Z" level=info msg="recv loop terminated: err=EOF" logger=zk
time="2023-08-28T13:32:42.598Z" level=info msg="send loop terminated: err=<nil>" logger=zk

Then:

elukey@kafka-main2001:~/T345077$ ./topicmappr rebalance --zk-addr "conf2005.codfw.wmnet:2181" --brokers -2 --topics '.*' --optimize-leadership --partition-size-threshold 10 --storage-threshold 0.01

Snippet of the results:

Brokers targeted for partition offloading (>= 1.00% threshold below hmean):

Reassignment parameters:
  Ignoring partitions smaller than 10MB
  Free storage mean, harmonic mean: 2327.74GB, 2324.50GB
  Broker free storage limits (with a 1.00% tolerance from mean):
    Sources limited to <= 2351.02GB
    Destinations limited to >= 2304.46GB

Broker 2004 relocations planned:
  [33.07GB] eqiad.resource-purge p2 -> 2001
  [22.11GB] eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate p5 -> 2001
  [21.76GB] eqiad.mediawiki.job.refreshLinks p2 -> 2001
  [17.78GB] eqiad.mediawiki.job.cirrusSearchElasticaWrite p4 -> 2002

Broker 2003 relocations planned:
  [32.54GB] eqiad.resource-purge p0 -> 2002
  [32.33GB] eqiad.resource-purge p3 -> 2002
  [22.25GB] eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate p7 -> 2001

Broker 2005 relocations planned:
  [32.46GB] eqiad.resource-purge p1 -> 2001
  [31.95GB] eqiad.resource-purge p4 -> 2001
  -
  Total relocation volume: 246.24GB

[..]

Broker distribution:
  degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00
  -
  Broker 2001 - leader: 282, follower: 564, total: 846
  Broker 2002 - leader: 280, follower: 560, total: 840
  Broker 2003 - leader: 273, follower: 546, total: 819
  Broker 2004 - leader: 89, follower: 178, total: 267
  Broker 2005 - leader: 63, follower: 126, total: 189

Storage free change estimations:
  range: 231.20GB -> 42.75GB
  range spread: 10.33% -> 1.85%
  std. deviation: 87.74GB -> 18.62GB
  min-max: 2238.48GB, 2469.68GB -> 2305.54GB, 2348.29GB
  -
  Broker 2001: 2469.68 -> 2306.08 (-163.60GB, -6.62%) 
  Broker 2002: 2388.19 -> 2305.54 (-82.64GB, -3.46%) 
  Broker 2003: 2258.49 -> 2345.60 (+87.12GB, 3.86%) 
  Broker 2004: 2238.48 -> 2333.20 (+94.72GB, 4.23%) 
  Broker 2005: 2283.88 -> 2348.29 (+64.41GB, 2.82%)

Checking from this graph it seems that topicmappr suggests to move ~250GBs to kafka-main200[12], that are indeed the two brokers in main-codfw with more free space.

I am trying to follow up on this:

Broker distribution:
  degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00
  -
  Broker 2001 - leader: 282, follower: 564, total: 846
  Broker 2002 - leader: 280, follower: 560, total: 840
  Broker 2003 - leader: 273, follower: 546, total: 819
  Broker 2004 - leader: 89, follower: 178, total: 267
  Broker 2005 - leader: 63, follower: 126, total: 189

The goal that I have in mind is to make the partition leaders count more even across brokers, but I am not 100% sure if this is really needed or not.

The above plan is generated by the flag --optimize-leadership, that IIUC tries to balance the leader/follower ratio on each broker (that is not related to what I wanted to do). I am not 100% sure what's best, I'll try to ask to upstream (namely if it is more desirable to have an even number of partitions or just a good leader/follower ratio).

I tried https://github.com/DataDog/kafka-kit/wiki/Rebuild-command#partition-count-rebalancing and this is the plan:

`
Broker distribution:
  degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00
  -
  Broker 2001 - leader: 198, follower: 394, total: 592
  Broker 2002 - leader: 197, follower: 395, total: 592
  Broker 2003 - leader: 198, follower: 394, total: 592
  Broker 2004 - leader: 197, follower: 395, total: 592
  Broker 2005 - leader: 197, follower: 396, total: 593

The above does what I want, but it uses rebuild that is more brutal (namely it forces more partitions to move across brokers, meanwhile rebalance is more conservative and less impactful). The rebuild result doesn't take into account storage usage, since it doesn't check any metric (so we may end up with even leader distribution but not even storage usage across brokers, that is not ideal).