Page MenuHomePhabricator

Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition
Closed, ResolvedPublic

Description

In the parent task we discovered that Kafka main eqiad brokers are currently suffering from traffic handling imbalance.

We should redo T288825 and rebalance partitions in both Kafka main eqiad and codfw to make things better.

Goals:

  • Idle percent time among brokers should be more evenly distributed. The minimal safe/good value from upstream docs is 20%, below that it means likely performance issues.
  • Better distribution of partition leaders.

Nice to have goals:

Event Timeline

From this query I found some manual steps that we could easily take:

  • eqiad.mediawiki.job.cirrusSearchLinksUpdate has only one partition, and it counts 300 events/s (leader 1001)
  • eqiad.mediawiki.job.parsoidCachePrewarm has one one partition, and it counts ~120 events/s (leader 1002)
  • codfw.mediawiki.job.parsoidCachePrewarm has one partition, and it counts ~80 events/s (leader 1002)
  • eqiad.mediawiki.job.wikibase-addUsagesForPage has one partition, and it counts ~80 events/s (leader 1001)
  • eqiad.mediawiki.job.htmlCacheUpdate has one partition, and it counts ~80 events/s (leader 1005)
  • eqiad.mediawiki.job.recentChangesUpdate has one partition, and it counts ~40 events/s (leader 1001)
  • eqiad.mediawiki.recentchange has one partition, and it counts ~40 events/s (leader 1002)
  • eqiad.mediawiki.job.EntityChangeNotification has one partition, and it counts ~35 events/s (leader 1001)
  • eqiad.mediawiki.page-links-change has one partition, and it counts ~35 events/s (leader 1005)

I stopped at 30 events/s, but we could probably start with the above increasing the topics to 3 partitions and (if necessary) move those partitions to 1004/5 brokers to remove load from 100[123]. Thoughts?

Proposed changes to both main eqiad and codfw:

kafka topics --topic eqiad.mediawiki.job.cirrusSearchLinksUpdate --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.cirrusSearchLinksUpdate --alter --partitions 3
kafka topics --topic eqiad.mediawiki.job.parsoidCachePrewarm --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.parsoidCachePrewarm --alter --partitions 3
kafka topics --topic eqiad.mediawiki.job.wikibase-addUsagesForPage --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.wikibase-addUsagesForPage --alter --partitions 3
kafka topics --topic eqiad.mediawiki.job.htmlCacheUpdate --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.htmlCacheUpdate --alter --partitions 3
kafka topics --topic eqiad.mediawiki.job.recentChangesUpdate --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.recentChangesUpdate --alter --partitions 3
kafka topics --topic eqiad.mediawiki.recentchange --alter --partitions 3
kafka topics --topic codfw.mediawiki.recentchange --alter --partitions 3
kafka topics --topic eqiad.mediawiki.job.EntityChangeNotification --alter --partitions 3
kafka topics --topic codfw.mediawiki.job.EntityChangeNotification --alter --partitions 3
kafka topics --topic eqiad.mediawiki.page-links-change --alter --partitions 3
kafka topics --topic codfw.mediawiki.page-links-change --alter --partitions 3

Mentioned in SAL (#wikimedia-operations) [2023-07-13T09:11:37Z] <elukey> increased kafka partitions for mediawiki.job.cirrusSearchLinksUpdate and mediawiki.job.cirrusSearchLinksUpdate (eqiad/codfw) - T341558

Tested the topicmappr's rebalance command (in the previous task we used rebuild since we had two new brokers to add) on kafka-test to see how it worked. The rebalance command needs metrics from prometheus stored in zookeeper, but sadly the Datadog's version only allows to query their API, not a generic Prometheus instance.

So I build and tested a Prometheus version of metricsfetcher (linked as third-party variation in the upstream docs) that worked nicely afaics.

elukey@kafka-test1006:~$ ./metrics-mapper --prometheus-url http://prometheus.svc.eqiad.wmnet/ops/ --zk-addr zookeeper-test1002.eqiad.wmnet:2181 --partition-size-query 'max(kafka_log_Size{cluster="kafka_test"}) by (topic,partition)' --broker-id-map "kafka-test1006:9100=1006,kafka-test1007:9100=1007,kafka-test1008:9100=1008,kafka-test1009:9100=1009,kafka-test1010:9100=1010" --broker-id-label instance --broker-storage-query 'node_filesystem_avail_bytes{cluster="kafka_test", mountpoint="/"}'
time="2023-07-13T13:06:28.338Z" level=info msg="Getting broker storage stats from Prometheus"
time="2023-07-13T13:06:28.339Z" level=info msg="Connected to [2620:0:861:102:10:64:16:145]:2181" logger=zk
time="2023-07-13T13:06:28.349Z" level=info msg="Broker ID Map: map[kafka-test1006:9100:1006 kafka-test1007:9100:1007 kafka-test1008:9100:1008 kafka-test1009:9100:1009 kafka-test1010:9100:1010]"
time="2023-07-13T13:06:28.349Z" level=info msg="Getting partition sizes from Prometheus"
time="2023-07-13T13:06:28.360Z" level=info msg="authenticated: id=72057606501630058, timeout=20000" logger=zk
time="2023-07-13T13:06:28.361Z" level=info msg="re-submitting `0` credentials after reconnect" logger=zk
time="2023-07-13T13:06:28.368Z" level=info msg="writing data to /topicmappr/partitionmeta"
time="2023-07-13T13:06:28.379Z" level=info msg="writing data to /topicmappr/brokermetrics"
time="2023-07-13T13:06:28.386Z" level=info msg="recv loop terminated: err=EOF" logger=zk
time="2023-07-13T13:06:28.386Z" level=info msg="send loop terminated: err=<nil>" logger=zk

The above retrieved two kind of metrics:

  • Partitions for each topic
  • Kafka broker (disk) partition size

They all end up in Zookeeper, see:

elukey@zookeeper-test1002:~$ sudo -u zookeeper /usr/share/zookeeper/bin/zkCli.sh 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Connecting to localhost:2181
Welcome to ZooKeeper!
JLine support is disabled
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
ls /
[kafka, topicmappr, zookeeper]
ls /topicmappr
[brokermetrics, partitionmeta]

Then I ran topicmappr rebalance with all the test topics:

elukey@kafka-test1006:~$ ./topicmappr rebalance --topics DataHubUpgradeHistory_v1,DataHubUsageEvent_v1,FailedMetadataChangeEvent_v4,FailedMetadataChangeProposal_v1,MetadataAuditEvent_v4,MetadataChangeEvent_v4,MetadataChangeLog_Timeseries_v1,MetadataChangeLog_Versioned_v1,MetadataChangeProposal_v1,PlatformEvent_v1,__consumer_offsets,_schemas,codfw.mediawiki.page_change.v1,eqiad.mediawiki.page_change.v1,eqiad.mediawiki.revision-create,eventlogging_SearchSatisfaction --zk-addr zookeeper-test1002.eqiad.wmnet:2181 --optimize-leadership --brokers -2

Topics:
  DataHubUpgradeHistory_v1
  DataHubUsageEvent_v1
  FailedMetadataChangeEvent_v4
  FailedMetadataChangeProposal_v1
  MetadataAuditEvent_v4
  MetadataChangeEvent_v4
  MetadataChangeLog_Timeseries_v1
  MetadataChangeLog_Versioned_v1
  MetadataChangeProposal_v1
  PlatformEvent_v1
  __consumer_offsets
  _schemas
  codfw.mediawiki.page_change.v1
  eqiad.mediawiki.page_change.v1
  eqiad.mediawiki.revision-create
  eventlogging_SearchSatisfaction

Validating broker list:
  Broker 1010 does not have a rack.id defined
  Broker 1008 does not have a rack.id defined
  Broker 1009 does not have a rack.id defined
  Broker 1006 does not have a rack.id defined
  Broker 1007 does not have a rack.id defined
  -

Brokers targeted for partition offloading (>= 20.00% threshold below hmean):

Reassignment parameters:
  Ignoring partitions smaller than 512MB
  Free storage mean, harmonic mean: 75.96GB, 75.65GB
  Broker free storage limits (with a 3.00% tolerance from mean):
    Sources limited to <= 78.24GB
    Destinations limited to >= 73.68GB
  -
  Total relocation volume: 0.00GB

Partition map changes:
  DataHubUpgradeHistory_v1 p0: [1010 1008 1009] -> [1010 1008 1009] no-op
  DataHubUsageEvent_v1 p0: [1008] -> [1008] no-op
  FailedMetadataChangeEvent_v4 p0: [1006] -> [1006] no-op
  FailedMetadataChangeProposal_v1 p0: [1010] -> [1010] no-op
  MetadataAuditEvent_v4 p0: [1006] -> [1006] no-op
  MetadataChangeEvent_v4 p0: [1009 1007 1008] -> [1008 1009 1007] preferred leader
  MetadataChangeLog_Timeseries_v1 p0: [1009 1006 1008] -> [1008 1009 1006] preferred leader
  MetadataChangeLog_Versioned_v1 p0: [1009 1006 1008] -> [1008 1009 1006] preferred leader
  MetadataChangeProposal_v1 p0: [1009 1006 1008] -> [1006 1009 1008] preferred leader
  PlatformEvent_v1 p0: [1009] -> [1009] no-op
  __consumer_offsets p0: [1009 1006 1007] -> [1009 1006 1007] no-op
  __consumer_offsets p1: [1006 1007 1008] -> [1008 1006 1007] preferred leader
  __consumer_offsets p2: [1007 1008 1009] -> [1007 1009 1008] preferred leader
  __consumer_offsets p3: [1008 1009 1006] -> [1009 1006 1008] preferred leader
  __consumer_offsets p4: [1009 1007 1008] -> [1007 1009 1008] preferred leader
  __consumer_offsets p5: [1006 1008 1009] -> [1009 1006 1008] preferred leader
  __consumer_offsets p6: [1007 1009 1006] -> [1006 1007 1009] preferred leader
  __consumer_offsets p7: [1008 1006 1007] -> [1007 1006 1008] preferred leader
  __consumer_offsets p8: [1009 1008 1006] -> [1006 1009 1008] preferred leader
  __consumer_offsets p9: [1006 1009 1007] -> [1009 1006 1007] preferred leader
  __consumer_offsets p10: [1007 1006 1008] -> [1008 1007 1006] preferred leader
  __consumer_offsets p11: [1008 1007 1009] -> [1008 1007 1009] no-op
  __consumer_offsets p12: [1009 1006 1007] -> [1007 1009 1006] preferred leader
  __consumer_offsets p13: [1006 1007 1008] -> [1006 1008 1007] preferred leader
  __consumer_offsets p14: [1007 1008 1009] -> [1007 1009 1008] preferred leader
  __consumer_offsets p15: [1008 1009 1006] -> [1008 1009 1006] no-op
  __consumer_offsets p16: [1009 1007 1008] -> [1009 1008 1007] preferred leader
  __consumer_offsets p17: [1006 1008 1009] -> [1009 1006 1008] preferred leader
  __consumer_offsets p18: [1007 1009 1006] -> [1006 1007 1009] preferred leader
  __consumer_offsets p19: [1008 1006 1007] -> [1007 1006 1008] preferred leader
  __consumer_offsets p20: [1009 1008 1006] -> [1006 1009 1008] preferred leader
  __consumer_offsets p21: [1006 1009 1007] -> [1009 1006 1007] preferred leader
  __consumer_offsets p22: [1007 1006 1008] -> [1008 1007 1006] preferred leader
  __consumer_offsets p23: [1008 1007 1009] -> [1008 1007 1009] no-op
  __consumer_offsets p24: [1009 1006 1007] -> [1007 1009 1006] preferred leader
  __consumer_offsets p25: [1006 1007 1008] -> [1006 1008 1007] preferred leader
  __consumer_offsets p26: [1007 1008 1009] -> [1007 1009 1008] preferred leader
  __consumer_offsets p27: [1008 1009 1006] -> [1008 1009 1006] no-op
  __consumer_offsets p28: [1009 1007 1008] -> [1009 1008 1007] preferred leader
  __consumer_offsets p29: [1006 1008 1009] -> [1009 1006 1008] preferred leader
  __consumer_offsets p30: [1007 1009 1006] -> [1006 1007 1009] preferred leader
  __consumer_offsets p31: [1008 1006 1007] -> [1007 1006 1008] preferred leader
  __consumer_offsets p32: [1009 1008 1006] -> [1006 1009 1008] preferred leader
  __consumer_offsets p33: [1006 1009 1007] -> [1009 1006 1007] preferred leader
  __consumer_offsets p34: [1007 1006 1008] -> [1008 1007 1006] preferred leader
  __consumer_offsets p35: [1008 1007 1009] -> [1008 1007 1009] no-op
  __consumer_offsets p36: [1009 1006 1007] -> [1007 1009 1006] preferred leader
  __consumer_offsets p37: [1006 1007 1008] -> [1006 1008 1007] preferred leader
  __consumer_offsets p38: [1007 1008 1009] -> [1007 1009 1008] preferred leader
  __consumer_offsets p39: [1008 1009 1006] -> [1008 1009 1006] no-op
  __consumer_offsets p40: [1009 1007 1008] -> [1009 1008 1007] preferred leader
  __consumer_offsets p41: [1006 1008 1009] -> [1009 1006 1008] preferred leader
  __consumer_offsets p42: [1007 1009 1006] -> [1006 1007 1009] preferred leader
  __consumer_offsets p43: [1008 1006 1007] -> [1007 1006 1008] preferred leader
  __consumer_offsets p44: [1009 1008 1006] -> [1006 1009 1008] preferred leader
  __consumer_offsets p45: [1006 1009 1007] -> [1009 1006 1007] preferred leader
  __consumer_offsets p46: [1007 1006 1008] -> [1008 1007 1006] preferred leader
  __consumer_offsets p47: [1008 1007 1009] -> [1008 1007 1009] no-op
  __consumer_offsets p48: [1009 1006 1007] -> [1007 1009 1006] preferred leader
  __consumer_offsets p49: [1006 1007 1008] -> [1006 1008 1007] preferred leader
  _schemas p0: [1009 1008 1006] -> [1009 1006 1008] preferred leader
  codfw.mediawiki.page_change.v1 p0: [1008 1009 1010] -> [1010 1008 1009] preferred leader
  eqiad.mediawiki.page_change.v1 p0: [1010 1007 1008] -> [1008 1007 1010] preferred leader
  eqiad.mediawiki.revision-create p0: [1010 1006 1007] -> [1007 1006 1010] preferred leader
  eventlogging_SearchSatisfaction p0: [1008 1009 1010] -> [1009 1008 1010] preferred leader

Broker distribution:
  degree [min/max/avg]: 4/4/4.00 -> 4/4/4.00
  -
  Broker 1006 - leader: 15, follower: 30, total: 45
  Broker 1007 - leader: 14, follower: 27, total: 41
  Broker 1008 - leader: 17, follower: 30, total: 47
  Broker 1009 - leader: 16, follower: 30, total: 46
  Broker 1010 - leader: 3, follower: 3, total: 6

Storage free change estimations:
  range: 10.76GB -> 10.76GB
  range spread: 15.09% -> 15.09%
  std. deviation: 4.92GB -> 4.92GB
  min-max: 71.30GB, 82.06GB -> 71.30GB, 82.06GB
  -
  Broker 1006: 81.89 -> 81.89 (+0.00GB, 0.00%) 
  Broker 1007: 82.06 -> 82.06 (+0.00GB, 0.00%) 
  Broker 1008: 72.37 -> 72.37 (+0.00GB, 0.00%) 
  Broker 1009: 72.19 -> 72.19 (+0.00GB, 0.00%) 
  Broker 1010: 71.30 -> 71.30 (+0.00GB, 0.00%) 

WARN:
  [none]

New partition maps:
  MetadataChangeLog_Timeseries_v1.json
  MetadataChangeProposal_v1.json
  __consumer_offsets.json
  eqiad.mediawiki.page_change.v1.json
  eqiad.mediawiki.revision-create.json
  eventlogging_SearchSatisfaction.json
  MetadataChangeEvent_v4.json
  MetadataChangeLog_Versioned_v1.json
  _schemas.json
  codfw.mediawiki.page_change.v1.json

The end result, as for rebuild, is a list of json files that we'll use to move kafka partitions:

elukey@kafka-test1006:~$ ls
codfw.mediawiki.page_change.v1.json  eqiad.mediawiki.revision-create.json  MetadataChangeLog_Timeseries_v1.json  metrics-mapper
__consumer_offsets.json		     eventlogging_SearchSatisfaction.json  MetadataChangeLog_Versioned_v1.json	 _schemas.json
eqiad.mediawiki.page_change.v1.json  MetadataChangeEvent_v4.json	   MetadataChangeProposal_v1.json	 topicmappr

elukey@kafka-test1006:~$ cat codfw.mediawiki.page_change.v1.json 
{"version":1,"partitions":[{"topic":"codfw.mediawiki.page_change.v1","partition":0,"replicas":[1010,1008,1009]}]}

Proposal - we could use the topics listed in this query as targets for the metrics retrieve + partition rebalance, generate the plan and execute it (with a slow pace) for kafka-main codfw and eqiad.

Bonus points to possibly package the two binaries into a WMF deb, but we can probably use the ones that I've built for the moment.

Generation of the plan for main-codfw:

elukey@kafka-main2001:~/T341558$ ./metrics-fetcher --prometheus-url http://prometheus.svc.codfw.wmnet/ops/ --zk-addr conf2004.codfw.wmnet:2181 --partition-size-query "max(kafka_log_Size{cluster=\"kafka_main\"}) by (topic,partition)" --broker-id-map "kafka-main2001:9100=2001,kafka-main2002:9100=2002,kafka-main2003:9100=2003,kafka-main2004:9100=2004,kafka-main2005:9100=2005" --broker-id-label instance --broker-storage-query "node_filesystem_avail_bytes{cluster=\"kafka_main\", site=\"codfw\",device=\"/dev/mapper/vg0-srv\"}"
time="2023-07-13T14:28:51.499Z" level=info msg="Getting broker storage stats from Prometheus"
time="2023-07-13T14:28:51.499Z" level=info msg="Connected to [2620:0:860:102:10:192:16:45]:2181" logger=zk
time="2023-07-13T14:28:51.502Z" level=info msg="authenticated: id=-3385087981172555752, timeout=20000" logger=zk
time="2023-07-13T14:28:51.502Z" level=info msg="re-submitting `0` credentials after reconnect" logger=zk
time="2023-07-13T14:28:51.516Z" level=info msg="Broker ID Map: map[kafka-main2001:9100:2001 kafka-main2002:9100:2002 kafka-main2003:9100:2003 kafka-main2004:9100:2004 kafka-main2005:9100:2005]"
time="2023-07-13T14:28:51.516Z" level=info msg="Getting partition sizes from Prometheus"
time="2023-07-13T14:28:51.599Z" level=info msg="writing data to /topicmappr/partitionmeta"
time="2023-07-13T14:28:51.606Z" level=info msg="writing data to /topicmappr/brokermetrics"
time="2023-07-13T14:28:51.608Z" level=info msg="recv loop terminated: err=EOF" logger=zk
[zk: localhost:2181(CONNECTED) 13] get /topicmappr/partitionmeta
{"__consumer_offsets":{"0":{"Size":3576733},"1":{"Size":34084745},"10":{"Size":17291428},"11":{"Size":70760121},"12":{"Size":75762067},"13":{"Size":544386},"14":{"Size":24395},"15":{"Size":45039},"16":{"Size":22296172},"17":{"Size":83196484},"18":{"Size":33453246},"19":{"Size":5541023},"2":{"Size":38091},"20":{"Size":28236208},"21" ......[cut]"

[zk: localhost:2181(CONNECTED) 14] get /topicmappr/brokermetrics
{"2001":{"StorageFree":2520825438208},"2002":{"StorageFree":2318926491648},"2003":{"StorageFree":2414767144960},"2004":{"StorageFree":2528093798400},"2005":{"StorageFree":2838692470784}}

The above were the metrics retrieved from Prometheus and stored in Zookeeper. Now the real plan:

elukey@kafka-main2001:~/T341558/main-codfw$ ../topicmappr rebalance --topics __consumer_offsets,__transaction_state,codfw.cpjobqueue.error,codfw.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,codfw.eventgate-main.test.event,codfw.maps.tiles_change,codfw.mediainfo-streaming-updater.mutation,codfw.mediawiki.image_suggestions_feedback,codfw.mediawiki.job.RecordLintJob,codfw.mediawiki.job.activityUpdateJob,codfw.mediawiki.job.cdnPurge,codfw.mediawiki.job.flaggedrevs_CacheUpdate,codfw.mediawiki.job.ipinfoLogIPInfoAccess,codfw.mediawiki.job.newUserMessageJob,codfw.mediawiki.job.newcomerTasksCacheRefreshJob,codfw.mediawiki.job.parsoidCachePrewarm,codfw.mediawiki.job.refreshLinksPrioritized,codfw.mediawiki.job.setUserMentorDatabaseJob,codfw.mediawiki.job.updateBetaFeaturesUserCounts,codfw.mediawiki.job.userOptionsUpdate,codfw.mediawiki.job.wikibase-addUsagesForPage,codfw.mediawiki.page_change.v1,codfw.mediawiki.page_outlink_topic_prediction_change.v1,codfw.mediawiki.recentchange,codfw.mediawiki.revision-recommendation-create,codfw.mediawiki.revision-score,codfw.mediawiki.revision-score-test,codfw.mediawiki.revision_score_articlequality,codfw.mediawiki.revision_score_articletopic,codfw.mediawiki.revision_score_damaging,codfw.mediawiki.revision_score_draftquality,codfw.mediawiki.revision_score_drafttopic,codfw.mediawiki.revision_score_goodfaith,codfw.mediawiki.revision_score_reverted,codfw.rdf-streaming-updater.fetch-failure,codfw.rdf-streaming-updater.lapsed-action,codfw.rdf-streaming-updater.mutation,codfw.rdf-streaming-updater.reconcile,codfw.rdf-streaming-updater.state-inconsistency,codfw.resource-purge,codfw.resource_change,eqiad.change-prop.backlinks.resource-change,eqiad.change-prop.transcludes.resource-change,eqiad.change-prop.wikidata.resource-change,eqiad.changeprop.error,eqiad.changeprop.retry.change-prop.backlinks.resource-change,eqiad.changeprop.retry.change-prop.transcludes.resource-change,eqiad.changeprop.retry.change-prop.wikidata.resource-change,eqiad.changeprop.retry.mediawiki.page-create,eqiad.changeprop.retry.mediawiki.revision-create,eqiad.changeprop.retry.resource_change,eqiad.cpjobqueue.error,eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate,eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob,eqiad.cpjobqueue.retry.mediawiki.job.RecordLintJob,eqiad.cpjobqueue.retry.mediawiki.job.categoryMembershipChange,eqiad.cpjobqueue.retry.mediawiki.job.checkuserPruneCheckUserDataJob,eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchLinksUpdate,eqiad.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,eqiad.cpjobqueue.retry.mediawiki.job.recentChangesUpdate,eqiad.cpjobqueue.retry.mediawiki.job.webVideoTranscode,eqiad.cpjobqueue.retry.mediawiki.job.wikibase-addUsagesForPage,eqiad.eventgate-main.error.validation,eqiad.eventgate-main.test.event,eqiad.maps.tiles_change,eqiad.mediainfo-streaming-updater.mutation,eqiad.mediawiki.image_suggestions_feedback,eqiad.mediawiki.job.ArticleChangedJob,eqiad.mediawiki.job.AssembleUploadChunks,eqiad.mediawiki.job.CentralAuthCreateLocalAccountJob,eqiad.mediawiki.job.CleanTermsIfUnused,eqiad.mediawiki.job.CognateCacheUpdateJob,eqiad.mediawiki.job.CognateLocalJobSubmitJob,eqiad.mediawiki.job.DispatchChanges,eqiad.mediawiki.job.EchoNotificationDeleteJob,eqiad.mediawiki.job.EchoPushNotificationRequest,eqiad.mediawiki.job.EntityChangeNotification,eqiad.mediawiki.job.GlobalUserPageLocalJobSubmitJob,eqiad.mediawiki.job.LocalGlobalUserPageCacheUpdateJob,eqiad.mediawiki.job.LocalPageMoveJob,eqiad.mediawiki.job.LocalRenameUserJob,eqiad.mediawiki.job.LoginNotifyChecks,eqiad.mediawiki.job.MessageGroupStatesUpdaterJob,eqiad.mediawiki.job.MessageGroupStatsRebuildJob,eqiad.mediawiki.job.MessageIndexRebuildJob,eqiad.mediawiki.job.ORESFetchScoreJob,eqiad.mediawiki.job.PublishStashedFile,eqiad.mediawiki.job.PurgeEntityData,eqiad.mediawiki.job.RecordLintJob,eqiad.mediawiki.job.RenderTranslationPageJob,eqiad.mediawiki.job.TTMServerMessageUpdateJob,eqiad.mediawiki.job.ThumbnailRender,eqiad.mediawiki.job.UpdateRepoOnDelete,eqiad.mediawiki.job.UpdateRepoOnMove,eqiad.mediawiki.job.UpdateTranslatablePageJob,eqiad.mediawiki.job.activityUpdateJob,eqiad.mediawiki.job.categoryMembershipChange,eqiad.mediawiki.job.cdnPurge,eqiad.mediawiki.job.checkuserPruneCheckUserDataJob,eqiad.mediawiki.job.cirrusSearchDeleteArchive,eqiad.mediawiki.job.cirrusSearchDeletePages,eqiad.mediawiki.job.cirrusSearchElasticaWrite,eqiad.mediawiki.job.cirrusSearchLinksUpdate,eqiad.mediawiki.job.cirrusSearchLinksUpdatePrioritized,eqiad.mediawiki.job.cirrusSearchOtherIndex,eqiad.mediawiki.job.constraintsRunCheck,eqiad.mediawiki.job.constraintsTableUpdate,eqiad.mediawiki.job.deletePage,eqiad.mediawiki.job.enotifNotify,eqiad.mediawiki.job.fetchGoogleCloudVisionAnnotations,eqiad.mediawiki.job.flaggedrevs_CacheUpdate,eqiad.mediawiki.job.globalUsageCachePurge,eqiad.mediawiki.job.htmlCacheUpdate,eqiad.mediawiki.job.ipinfoLogIPInfoAccess,eqiad.mediawiki.job.newUserMessageJob,eqiad.mediawiki.job.newcomerTasksCacheRefreshJob,eqiad.mediawiki.job.notificationGetStartedJob,eqiad.mediawiki.job.notificationKeepGoingJob,eqiad.mediawiki.job.parsoidCachePrewarm,eqiad.mediawiki.job.recentChangesUpdate,eqiad.mediawiki.job.refreshLinks,eqiad.mediawiki.job.refreshLinksPrioritized,eqiad.mediawiki.job.refreshUserImpactJob,eqiad.mediawiki.job.revertedTagUpdate,eqiad.mediawiki.job.setUserMentorDatabaseJob,eqiad.mediawiki.job.updateBetaFeaturesUserCounts,eqiad.mediawiki.job.userOptionsUpdate,eqiad.mediawiki.job.watchlistExpiry,eqiad.mediawiki.job.webVideoTranscode,eqiad.mediawiki.job.webVideoTranscodePrioritized,eqiad.mediawiki.job.wikibase-InjectRCRecords,eqiad.mediawiki.job.wikibase-addUsagesForPage,eqiad.mediawiki.page-create,eqiad.mediawiki.page-delete,eqiad.mediawiki.page-links-change,eqiad.mediawiki.page-move,eqiad.mediawiki.page-properties-change,eqiad.mediawiki.page-restrictions-change,eqiad.mediawiki.page-undelete,eqiad.mediawiki.page_change.v1,eqiad.mediawiki.page_outlink_topic_prediction_change.v1,eqiad.mediawiki.recentchange,eqiad.mediawiki.revision-create,eqiad.mediawiki.revision-recommendation-create,eqiad.mediawiki.revision-score,eqiad.mediawiki.revision-score-test,eqiad.mediawiki.revision-tags-change,eqiad.mediawiki.revision-visibility-change,eqiad.mediawiki.revision_score_articlequality,eqiad.mediawiki.revision_score_articletopic,eqiad.mediawiki.revision_score_damaging,eqiad.mediawiki.revision_score_draftquality,eqiad.mediawiki.revision_score_drafttopic,eqiad.mediawiki.revision_score_goodfaith,eqiad.mediawiki.revision_score_reverted,eqiad.mediawiki.user-blocks-change,eqiad.rdf-streaming-updater.fetch-failure,eqiad.rdf-streaming-updater.lapsed-action,eqiad.rdf-streaming-updater.mutation,eqiad.rdf-streaming-updater.reconcile,eqiad.rdf-streaming-updater.state-inconsistency,eqiad.resource-purge,eqiad.resource_change --zk-addr conf2005.codfw.wmnet:2181 --optimize-leadership --brokers -2

Going to past the final result when I find a good configuration.

Instead of using rebalance, I tried rebuild (that was the same command used for the last batch of partition moves) in this way:

./topicmappr rebuild --force-rebuild --topics __consumer_offsets,__transaction_state,codfw.cpjobqueue.error,codfw.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,codfw.eventgate-main.test.event,codfw.maps.tiles_change,codfw.mediainfo-streaming-updater.mutation,codfw.mediawiki.image_suggestions_feedback,codfw.mediawiki.job.RecordLintJob,codfw.mediawiki.job.activityUpdateJob,codfw.mediawiki.job.cdnPurge,codfw.mediawiki.job.flaggedrevs_CacheUpdate,codfw.mediawiki.job.ipinfoLogIPInfoAccess,codfw.mediawiki.job.newUserMessageJob,codfw.mediawiki.job.newcomerTasksCacheRefreshJob,codfw.mediawiki.job.parsoidCachePrewarm,codfw.mediawiki.job.refreshLinksPrioritized,codfw.mediawiki.job.setUserMentorDatabaseJob,codfw.mediawiki.job.updateBetaFeaturesUserCounts,codfw.mediawiki.job.userOptionsUpdate,codfw.mediawiki.job.wikibase-addUsagesForPage,codfw.mediawiki.page_change.v1,codfw.mediawiki.page_outlink_topic_prediction_change.v1,codfw.mediawiki.recentchange,codfw.mediawiki.revision-recommendation-create,codfw.mediawiki.revision-score,codfw.mediawiki.revision-score-test,codfw.mediawiki.revision_score_articlequality,codfw.mediawiki.revision_score_articletopic,codfw.mediawiki.revision_score_damaging,codfw.mediawiki.revision_score_draftquality,codfw.mediawiki.revision_score_drafttopic,codfw.mediawiki.revision_score_goodfaith,codfw.mediawiki.revision_score_reverted,codfw.rdf-streaming-updater.fetch-failure,codfw.rdf-streaming-updater.lapsed-action,codfw.rdf-streaming-updater.mutation,codfw.rdf-streaming-updater.reconcile,codfw.rdf-streaming-updater.state-inconsistency,codfw.resource-purge,codfw.resource_change,eqiad.change-prop.backlinks.resource-change,eqiad.change-prop.transcludes.resource-change,eqiad.change-prop.wikidata.resource-change,eqiad.changeprop.error,eqiad.changeprop.retry.change-prop.backlinks.resource-change,eqiad.changeprop.retry.change-prop.transcludes.resource-change,eqiad.changeprop.retry.change-prop.wikidata.resource-change,eqiad.changeprop.retry.mediawiki.page-create,eqiad.changeprop.retry.mediawiki.revision-create,eqiad.changeprop.retry.resource_change,eqiad.cpjobqueue.error,eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate,eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob,eqiad.cpjobqueue.retry.mediawiki.job.RecordLintJob,eqiad.cpjobqueue.retry.mediawiki.job.categoryMembershipChange,eqiad.cpjobqueue.retry.mediawiki.job.checkuserPruneCheckUserDataJob,eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchLinksUpdate,eqiad.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,eqiad.cpjobqueue.retry.mediawiki.job.recentChangesUpdate,eqiad.cpjobqueue.retry.mediawiki.job.webVideoTranscode,eqiad.cpjobqueue.retry.mediawiki.job.wikibase-addUsagesForPage,eqiad.eventgate-main.error.validation,eqiad.eventgate-main.test.event,eqiad.maps.tiles_change,eqiad.mediainfo-streaming-updater.mutation,eqiad.mediawiki.image_suggestions_feedback,eqiad.mediawiki.job.ArticleChangedJob,eqiad.mediawiki.job.AssembleUploadChunks,eqiad.mediawiki.job.CentralAuthCreateLocalAccountJob,eqiad.mediawiki.job.CleanTermsIfUnused,eqiad.mediawiki.job.CognateCacheUpdateJob,eqiad.mediawiki.job.CognateLocalJobSubmitJob,eqiad.mediawiki.job.DispatchChanges,eqiad.mediawiki.job.EchoNotificationDeleteJob,eqiad.mediawiki.job.EchoPushNotificationRequest,eqiad.mediawiki.job.EntityChangeNotification,eqiad.mediawiki.job.GlobalUserPageLocalJobSubmitJob,eqiad.mediawiki.job.LocalGlobalUserPageCacheUpdateJob,eqiad.mediawiki.job.LocalPageMoveJob,eqiad.mediawiki.job.LocalRenameUserJob,eqiad.mediawiki.job.LoginNotifyChecks,eqiad.mediawiki.job.MessageGroupStatesUpdaterJob,eqiad.mediawiki.job.MessageGroupStatsRebuildJob,eqiad.mediawiki.job.MessageIndexRebuildJob,eqiad.mediawiki.job.ORESFetchScoreJob,eqiad.mediawiki.job.PublishStashedFile,eqiad.mediawiki.job.PurgeEntityData,eqiad.mediawiki.job.RecordLintJob,eqiad.mediawiki.job.RenderTranslationPageJob,eqiad.mediawiki.job.TTMServerMessageUpdateJob,eqiad.mediawiki.job.ThumbnailRender,eqiad.mediawiki.job.UpdateRepoOnDelete,eqiad.mediawiki.job.UpdateRepoOnMove,eqiad.mediawiki.job.UpdateTranslatablePageJob,eqiad.mediawiki.job.activityUpdateJob,eqiad.mediawiki.job.categoryMembershipChange,eqiad.mediawiki.job.cdnPurge,eqiad.mediawiki.job.checkuserPruneCheckUserDataJob,eqiad.mediawiki.job.cirrusSearchDeleteArchive,eqiad.mediawiki.job.cirrusSearchDeletePages,eqiad.mediawiki.job.cirrusSearchElasticaWrite,eqiad.mediawiki.job.cirrusSearchLinksUpdate,eqiad.mediawiki.job.cirrusSearchLinksUpdatePrioritized,eqiad.mediawiki.job.cirrusSearchOtherIndex,eqiad.mediawiki.job.constraintsRunCheck,eqiad.mediawiki.job.constraintsTableUpdate,eqiad.mediawiki.job.deletePage,eqiad.mediawiki.job.enotifNotify,eqiad.mediawiki.job.fetchGoogleCloudVisionAnnotations,eqiad.mediawiki.job.flaggedrevs_CacheUpdate,eqiad.mediawiki.job.globalUsageCachePurge,eqiad.mediawiki.job.htmlCacheUpdate,eqiad.mediawiki.job.ipinfoLogIPInfoAccess,eqiad.mediawiki.job.newUserMessageJob,eqiad.mediawiki.job.newcomerTasksCacheRefreshJob,eqiad.mediawiki.job.notificationGetStartedJob,eqiad.mediawiki.job.notificationKeepGoingJob,eqiad.mediawiki.job.parsoidCachePrewarm,eqiad.mediawiki.job.recentChangesUpdate,eqiad.mediawiki.job.refreshLinks,eqiad.mediawiki.job.refreshLinksPrioritized,eqiad.mediawiki.job.refreshUserImpactJob,eqiad.mediawiki.job.revertedTagUpdate,eqiad.mediawiki.job.setUserMentorDatabaseJob,eqiad.mediawiki.job.updateBetaFeaturesUserCounts,eqiad.mediawiki.job.userOptionsUpdate,eqiad.mediawiki.job.watchlistExpiry,eqiad.mediawiki.job.webVideoTranscode,eqiad.mediawiki.job.webVideoTranscodePrioritized,eqiad.mediawiki.job.wikibase-InjectRCRecords,eqiad.mediawiki.job.wikibase-addUsagesForPage,eqiad.mediawiki.page-create,eqiad.mediawiki.page-delete,eqiad.mediawiki.page-links-change,eqiad.mediawiki.page-move,eqiad.mediawiki.page-properties-change,eqiad.mediawiki.page-restrictions-change,eqiad.mediawiki.page-undelete,eqiad.mediawiki.page_change.v1,eqiad.mediawiki.page_outlink_topic_prediction_change.v1,eqiad.mediawiki.recentchange,eqiad.mediawiki.revision-create,eqiad.mediawiki.revision-recommendation-create,eqiad.mediawiki.revision-score,eqiad.mediawiki.revision-score-test,eqiad.mediawiki.revision-tags-change,eqiad.mediawiki.revision-visibility-change,eqiad.mediawiki.revision_score_articlequality,eqiad.mediawiki.revision_score_articletopic,eqiad.mediawiki.revision_score_damaging,eqiad.mediawiki.revision_score_draftquality,eqiad.mediawiki.revision_score_drafttopic,eqiad.mediawiki.revision_score_goodfaith,eqiad.mediawiki.revision_score_reverted,eqiad.mediawiki.user-blocks-change,eqiad.rdf-streaming-updater.fetch-failure,eqiad.rdf-streaming-updater.lapsed-action,eqiad.rdf-streaming-updater.mutation,eqiad.rdf-streaming-updater.reconcile,eqiad.rdf-streaming-updater.state-inconsistency,eqiad.resource-purge,eqiad.resource_change --zk-addr conf2005.codfw.wmnet:2181 --optimize-leadership --brokers -2

Among all the moves, I have isolated the ones having 2004 or 2005 as partition leaders:

{"version":1,"partitions":[{"topic":"codfw.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm","partition":0,"replicas":[2005,2001,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.eventgate-main.test.event","partition":0,"replicas":[2004,2001,2002]}]}
{"version":1,"partitions":[{"topic":"codfw.maps.tiles_change","partition":0,"replicas":[2004,2003,2005]},{"topic":"codfw.maps.tiles_change","partition":1,"replicas":[2001,2005,2004]},{"topic":"codfw.maps.tiles_change","partition":2,"replicas":[2002,2003,2001]},{"topic":"codfw.maps.tiles_change","partition":3,"replicas":[2003,2001,2002]},{"topic":"codfw.maps.tiles_change","partition":4,"replicas":[2001,2002,2004]},{"topic":"codfw.maps.tiles_change","partition":5,"replicas":[2003,2004,2001]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.image_suggestions_feedback","partition":0,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.cdnPurge","partition":0,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.flaggedrevs_CacheUpdate","partition":0,"replicas":[2004,2003,2005]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.newcomerTasksCacheRefreshJob","partition":0,"replicas":[2005,2001,2002]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.RecordLintJob","partition":0,"replicas":[2001,2004,2005]},{"topic":"codfw.mediawiki.job.RecordLintJob","partition":1,"replicas":[2002,2005,2003]},{"topic":"codfw.mediawiki.job.RecordLintJob","partition":2,"replicas":[2004,2001,2002]},{"topic":"codfw.mediawiki.job.RecordLintJob","partition":3,"replicas":[2005,2001,2002]},{"topic":"codfw.mediawiki.job.RecordLintJob","partition":4,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.refreshLinksPrioritized","partition":0,"replicas":[2005,2004,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.job.userOptionsUpdate","partition":0,"replicas":[2004,2005,2002]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.recentchange","partition":0,"replicas":[2005,2003,2002]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.revision_score_damaging","partition":0,"replicas":[2005,2004,2001]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.revision_score_draftquality","partition":0,"replicas":[2004,2005,2001]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.revision_score_goodfaith","partition":0,"replicas":[2005,2002,2001]}]}
{"version":1,"partitions":[{"topic":"codfw.mediawiki.revision-score-test","partition":0,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.rdf-streaming-updater.lapsed-action","partition":0,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"codfw.rdf-streaming-updater.mutation","partition":0,"replicas":[2004,2003,2002]}]}
{"version":1,"partitions":[{"topic":"codfw.rdf-streaming-updater.reconcile","partition":0,"replicas":[2005,2004,2001]}]}
{"version":1,"partitions":[{"topic":"codfw.resource-purge","partition":0,"replicas":[2003,2004,2005]},{"topic":"codfw.resource-purge","partition":1,"replicas":[2005,2001,2003]},{"topic":"codfw.resource-purge","partition":2,"replicas":[2004,2002,2003]},{"topic":"codfw.resource-purge","partition":3,"replicas":[2002,2001,2004]},{"topic":"codfw.resource-purge","partition":4,"replicas":[2004,2003,2005]}]}
{"version":1,"partitions":[{"topic":"__consumer_offsets","partition":0,"replicas":[2005,2002,2003]},{"topic":"__consumer_offsets","partition":1,"replicas":[2001,2004,2005]},{"topic":"__consumer_offsets","partition":2,"replicas":[2002,2004,2001]},{"topic":"__consumer_offsets","partition":3,"replicas":[2004,2003,2002]},{"topic":"__consumer_offsets","partition":4,"replicas":[2005,2003,2002]},{"topic":"__consumer_offsets","partition":5,"replicas":[2004,2003,2005]},{"topic":"__consumer_offsets","partition":6,"replicas":[2003,2005,2001]},{"topic":"__consumer_offsets","partition":7,"replicas":[2001,2004,2002]},{"topic":"__consumer_offsets","partition":8,"replicas":[2005,2004,2001]},{"topic":"__consumer_offsets","partition":9,"replicas":[2002,2001,2005]},{"topic":"__consumer_offsets","partition":10,"replicas":[2004,2003,2002]},{"topic":"__consumer_offsets","partition":11,"replicas":[2003,2001,2005]},{"topic":"__consumer_offsets","partition":12,"replicas":[2003,2002,2004]},{"topic":"__consumer_offsets","partition":13,"replicas":[2001,2004,2002]},{"topic":"__consumer_offsets","partition":14,"replicas":[2005,2004,2003]},{"topic":"__consumer_offsets","partition":15,"replicas":[2002,2005,2003]},{"topic":"__consumer_offsets","partition":16,"replicas":[2005,2004,2001]},{"topic":"__consumer_offsets","partition":17,"replicas":[2001,2002,2003]},{"topic":"__consumer_offsets","partition":18,"replicas":[2004,2005,2001]},{"topic":"__consumer_offsets","partition":19,"replicas":[2005,2004,2001]},{"topic":"__consumer_offsets","partition":20,"replicas":[2002,2003,2005]},{"topic":"__consumer_offsets","partition":21,"replicas":[2003,2001,2002]},{"topic":"__consumer_offsets","partition":22,"replicas":[2002,2004,2003]},{"topic":"__consumer_offsets","partition":23,"replicas":[2001,2003,2004]},{"topic":"__consumer_offsets","partition":24,"replicas":[2001,2002,2005]},{"topic":"__consumer_offsets","partition":25,"replicas":[2004,2003,2005]},{"topic":"__consumer_offsets","partition":26,"replicas":[2003,2002,2001]},{"topic":"__consumer_offsets","partition":27,"replicas":[2002,2005,2003]},{"topic":"__consumer_offsets","partition":28,"replicas":[2004,2002,2001]},{"topic":"__consumer_offsets","partition":29,"replicas":[2005,2002,2001]},{"topic":"__consumer_offsets","partition":30,"replicas":[2003,2005,2004]},{"topic":"__consumer_offsets","partition":31,"replicas":[2005,2004,2001]},{"topic":"__consumer_offsets","partition":32,"replicas":[2004,2002,2003]},{"topic":"__consumer_offsets","partition":33,"replicas":[2001,2004,2003]},{"topic":"__consumer_offsets","partition":34,"replicas":[2005,2002,2001]},{"topic":"__consumer_offsets","partition":35,"replicas":[2002,2003,2005]},{"topic":"__consumer_offsets","partition":36,"replicas":[2003,2004,2001]},{"topic":"__consumer_offsets","partition":37,"replicas":[2002,2004,2003]},{"topic":"__consumer_offsets","partition":38,"replicas":[2001,2004,2002]},{"topic":"__consumer_offsets","partition":39,"replicas":[2004,2002,2005]},{"topic":"__consumer_offsets","partition":40,"replicas":[2003,2005,2001]},{"topic":"__consumer_offsets","partition":41,"replicas":[2001,2005,2003]},{"topic":"__consumer_offsets","partition":42,"replicas":[2005,2004,2002]},{"topic":"__consumer_offsets","partition":43,"replicas":[2002,2004,2001]},{"topic":"__consumer_offsets","partition":44,"replicas":[2005,2001,2002]},{"topic":"__consumer_offsets","partition":45,"replicas":[2004,2003,2005]},{"topic":"__consumer_offsets","partition":46,"replicas":[2003,2001,2004]},{"topic":"__consumer_offsets","partition":47,"replicas":[2004,2002,2003]},{"topic":"__consumer_offsets","partition":48,"replicas":[2001,2002,2004]},{"topic":"__consumer_offsets","partition":49,"replicas":[2002,2003,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.changeprop.retry.change-prop.wikidata.resource-change","partition":0,"replicas":[2004,2001,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.changeprop.retry.mediawiki.page-create","partition":0,"replicas":[2005,2003,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.change-prop.transcludes.resource-change","partition":0,"replicas":[2005,2004,2002]},{"topic":"eqiad.change-prop.transcludes.resource-change","partition":1,"replicas":[2003,2001,2005]},{"topic":"eqiad.change-prop.transcludes.resource-change","partition":2,"replicas":[2001,2004,2005]},{"topic":"eqiad.change-prop.transcludes.resource-change","partition":3,"replicas":[2002,2004,2001]},{"topic":"eqiad.change-prop.transcludes.resource-change","partition":4,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.change-prop.wikidata.resource-change","partition":0,"replicas":[2005,2004,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":0,"replicas":[2004,2005,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":1,"replicas":[2005,2003,2002]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":2,"replicas":[2003,2004,2002]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":3,"replicas":[2001,2003,2005]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":4,"replicas":[2002,2004,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite","partition":5,"replicas":[2004,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":0,"replicas":[2004,2005,2002]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":1,"replicas":[2005,2001,2003]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":2,"replicas":[2005,2004,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":3,"replicas":[2002,2004,2005]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":4,"replicas":[2001,2004,2003]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":5,"replicas":[2003,2005,2002]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":6,"replicas":[2003,2002,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate","partition":7,"replicas":[2004,2005,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":0,"replicas":[2001,2002,2004]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":1,"replicas":[2005,2004,2003]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":2,"replicas":[2002,2005,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":3,"replicas":[2001,2005,2003]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":4,"replicas":[2003,2002,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":5,"replicas":[2002,2005,2001]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":6,"replicas":[2004,2003,2002]},{"topic":"eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks","partition":7,"replicas":[2005,2002,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchLinksUpdate","partition":0,"replicas":[2004,2002,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob","partition":0,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm","partition":0,"replicas":[2005,2003,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.retry.mediawiki.job.RecordLintJob","partition":0,"replicas":[2005,2004,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.cpjobqueue.retry.mediawiki.job.wikibase-addUsagesForPage","partition":0,"replicas":[2005,2004,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.eventgate-main.error.validation","partition":0,"replicas":[2004,2005,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.maps.tiles_change","partition":0,"replicas":[2004,2001,2003]},{"topic":"eqiad.maps.tiles_change","partition":1,"replicas":[2005,2002,2003]},{"topic":"eqiad.maps.tiles_change","partition":2,"replicas":[2001,2004,2003]},{"topic":"eqiad.maps.tiles_change","partition":3,"replicas":[2002,2004,2005]},{"topic":"eqiad.maps.tiles_change","partition":4,"replicas":[2003,2005,2002]},{"topic":"eqiad.maps.tiles_change","partition":5,"replicas":[2004,2001,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediainfo-streaming-updater.mutation","partition":0,"replicas":[2005,2001,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.activityUpdateJob","partition":0,"replicas":[2005,2002,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.AssembleUploadChunks","partition":0,"replicas":[2005,2003,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.categoryMembershipChange","partition":0,"replicas":[2004,2002,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.cdnPurge","partition":0,"replicas":[2004,2003,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.cirrusSearchElasticaWrite","partition":0,"replicas":[2002,2004,2003]},{"topic":"eqiad.mediawiki.job.cirrusSearchElasticaWrite","partition":1,"replicas":[2005,2004,2001]},{"topic":"eqiad.mediawiki.job.cirrusSearchElasticaWrite","partition":2,"replicas":[2003,2002,2005]},{"topic":"eqiad.mediawiki.job.cirrusSearchElasticaWrite","partition":3,"replicas":[2001,2002,2005]},{"topic":"eqiad.mediawiki.job.cirrusSearchElasticaWrite","partition":4,"replicas":[2002,2001,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.cirrusSearchLinksUpdate","partition":0,"replicas":[2001,2005,2004]},{"topic":"eqiad.mediawiki.job.cirrusSearchLinksUpdate","partition":1,"replicas":[2005,2004,2003]},{"topic":"eqiad.mediawiki.job.cirrusSearchLinksUpdate","partition":2,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.cirrusSearchLinksUpdatePrioritized","partition":0,"replicas":[2005,2002,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.cirrusSearchOtherIndex","partition":0,"replicas":[2004,2005,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.CleanTermsIfUnused","partition":0,"replicas":[2004,2001,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.CognateLocalJobSubmitJob","partition":0,"replicas":[2004,2003,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.constraintsTableUpdate","partition":0,"replicas":[2005,2003,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.EntityChangeNotification","partition":0,"replicas":[2005,2003,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.flaggedrevs_CacheUpdate","partition":0,"replicas":[2004,2001,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.htmlCacheUpdate","partition":0,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.LocalGlobalUserPageCacheUpdateJob","partition":0,"replicas":[2004,2001,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.MessageGroupStatesUpdaterJob","partition":0,"replicas":[2005,2004,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.notificationGetStartedJob","partition":0,"replicas":[2005,2004,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.notificationKeepGoingJob","partition":0,"replicas":[2004,2003,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.ORESFetchScoreJob","partition":0,"replicas":[2004,2005,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.parsoidCachePrewarm","partition":0,"replicas":[2001,2005,2002]},{"topic":"eqiad.mediawiki.job.parsoidCachePrewarm","partition":1,"replicas":[2003,2002,2005]},{"topic":"eqiad.mediawiki.job.parsoidCachePrewarm","partition":2,"replicas":[2005,2001,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.PurgeEntityData","partition":0,"replicas":[2005,2004,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.RecordLintJob","partition":0,"replicas":[2004,2005,2001]},{"topic":"eqiad.mediawiki.job.RecordLintJob","partition":1,"replicas":[2001,2004,2002]},{"topic":"eqiad.mediawiki.job.RecordLintJob","partition":2,"replicas":[2005,2004,2001]},{"topic":"eqiad.mediawiki.job.RecordLintJob","partition":3,"replicas":[2002,2001,2005]},{"topic":"eqiad.mediawiki.job.RecordLintJob","partition":4,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.refreshLinks","partition":0,"replicas":[2004,2003,2001]},{"topic":"eqiad.mediawiki.job.refreshLinks","partition":1,"replicas":[2003,2001,2002]},{"topic":"eqiad.mediawiki.job.refreshLinks","partition":2,"replicas":[2005,2003,2002]},{"topic":"eqiad.mediawiki.job.refreshLinks","partition":3,"replicas":[2001,2002,2004]},{"topic":"eqiad.mediawiki.job.refreshLinks","partition":4,"replicas":[2004,2005,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.revertedTagUpdate","partition":0,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.UpdateRepoOnDelete","partition":0,"replicas":[2004,2003,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.webVideoTranscodePrioritized","partition":0,"replicas":[2004,2001,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.job.wikibase-InjectRCRecords","partition":0,"replicas":[2005,2004,2002]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.page_change.v1","partition":0,"replicas":[2004,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.page-delete","partition":0,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.page-links-change","partition":0,"replicas":[2004,2005,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.page-move","partition":0,"replicas":[2005,2004,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.page_outlink_topic_prediction_change.v1","partition":0,"replicas":[2004,2001,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.revision-create","partition":0,"replicas":[2005,2001,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.revision_score_articletopic","partition":0,"replicas":[2004,2005,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.revision_score_damaging","partition":0,"replicas":[2005,2001,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.revision-score","partition":0,"replicas":[2005,2002,2004]}]}
{"version":1,"partitions":[{"topic":"eqiad.mediawiki.user-blocks-change","partition":0,"replicas":[2004,2001,2005]}]}
{"version":1,"partitions":[{"topic":"eqiad.rdf-streaming-updater.lapsed-action","partition":0,"replicas":[2004,2001,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.rdf-streaming-updater.mutation","partition":0,"replicas":[2005,2002,2003]}]}
{"version":1,"partitions":[{"topic":"eqiad.rdf-streaming-updater.state-inconsistency","partition":0,"replicas":[2004,2003,2001]}]}
{"version":1,"partitions":[{"topic":"eqiad.resource-purge","partition":0,"replicas":[2003,2001,2004]},{"topic":"eqiad.resource-purge","partition":1,"replicas":[2005,2004,2002]},{"topic":"eqiad.resource-purge","partition":2,"replicas":[2004,2005,2002]},{"topic":"eqiad.resource-purge","partition":3,"replicas":[2005,2001,2003]},{"topic":"eqiad.resource-purge","partition":4,"replicas":[2002,2003,2005]}]}
{"version":1,"partitions":[{"topic":"__transaction_state","partition":0,"replicas":[2003,2001,2005]},{"topic":"__transaction_state","partition":1,"replicas":[2001,2002,2005]},{"topic":"__transaction_state","partition":2,"replicas":[2005,2002,2001]},{"topic":"__transaction_state","partition":3,"replicas":[2004,2002,2003]},{"topic":"__transaction_state","partition":4,"replicas":[2005,2001,2003]},{"topic":"__transaction_state","partition":5,"replicas":[2004,2003,2005]},{"topic":"__transaction_state","partition":6,"replicas":[2001,2002,2004]},{"topic":"__transaction_state","partition":7,"replicas":[2002,2003,2004]},{"topic":"__transaction_state","partition":8,"replicas":[2001,2004,2005]},{"topic":"__transaction_state","partition":9,"replicas":[2003,2004,2005]},{"topic":"__transaction_state","partition":10,"replicas":[2002,2003,2005]},{"topic":"__transaction_state","partition":11,"replicas":[2003,2002,2001]},{"topic":"__transaction_state","partition":12,"replicas":[2002,2005,2001]},{"topic":"__transaction_state","partition":13,"replicas":[2004,2001,2002]},{"topic":"__transaction_state","partition":14,"replicas":[2005,2001,2004]},{"topic":"__transaction_state","partition":15,"replicas":[2004,2005,2003]},{"topic":"__transaction_state","partition":16,"replicas":[2003,2001,2004]},{"topic":"__transaction_state","partition":17,"replicas":[2001,2002,2003]},{"topic":"__transaction_state","partition":18,"replicas":[2002,2004,2001]},{"topic":"__transaction_state","partition":19,"replicas":[2005,2002,2003]},{"topic":"__transaction_state","partition":20,"replicas":[2004,2003,2005]},{"topic":"__transaction_state","partition":21,"replicas":[2001,2002,2005]},{"topic":"__transaction_state","partition":22,"replicas":[2005,2002,2003]},{"topic":"__transaction_state","partition":23,"replicas":[2002,2005,2004]},{"topic":"__transaction_state","partition":24,"replicas":[2003,2005,2001]},{"topic":"__transaction_state","partition":25,"replicas":[2003,2004,2001]},{"topic":"__transaction_state","partition":26,"replicas":[2001,2004,2003]},{"topic":"__transaction_state","partition":27,"replicas":[2004,2002,2001]},{"topic":"__transaction_state","partition":28,"replicas":[2002,2005,2004]},{"topic":"__transaction_state","partition":29,"replicas":[2005,2004,2001]},{"topic":"__transaction_state","partition":30,"replicas":[2005,2003,2002]},{"topic":"__transaction_state","partition":31,"replicas":[2003,2001,2002]},{"topic":"__transaction_state","partition":32,"replicas":[2002,2004,2003]},{"topic":"__transaction_state","partition":33,"replicas":[2001,2004,2005]},{"topic":"__transaction_state","partition":34,"replicas":[2003,2002,2005]},{"topic":"__transaction_state","partition":35,"replicas":[2004,2005,2003]},{"topic":"__transaction_state","partition":36,"replicas":[2005,2001,2002]},{"topic":"__transaction_state","partition":37,"replicas":[2001,2002,2003]},{"topic":"__transaction_state","partition":38,"replicas":[2004,2001,2005]},{"topic":"__transaction_state","partition":39,"replicas":[2005,2004,2003]},{"topic":"__transaction_state","partition":40,"replicas":[2002,2003,2001]},{"topic":"__transaction_state","partition":41,"replicas":[2001,2004,2005]},{"topic":"__transaction_state","partition":42,"replicas":[2003,2004,2002]},{"topic":"__transaction_state","partition":43,"replicas":[2004,2005,2001]},{"topic":"__transaction_state","partition":44,"replicas":[2002,2005,2003]},{"topic":"__transaction_state","partition":45,"replicas":[2003,2004,2001]},{"topic":"__transaction_state","partition":46,"replicas":[2001,2005,2002]},{"topic":"__transaction_state","partition":47,"replicas":[2002,2005,2004]},{"topic":"__transaction_state","partition":48,"replicas":[2004,2003,2002]},{"topic":"__transaction_state","partition":49,"replicas":[2005,2003,2001]}]}

All the above is split into multiple files, we'll apply them one a the time. The idea is to move more partitions to 2004 and 2005, to improve the balance between brokers.

As happened the last time I created https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-codfw. I'll create and commit "completed" and "rollback" command json files to use in case we need to restore the status.

Mentioned in SAL (#wikimedia-operations) [2023-07-17T14:10:16Z] <elukey> start kafka partitions rebalance for main-codfw (long running maintenance, see https://phabricator.wikimedia.org/T341558)

Mentioned in SAL (#wikimedia-operations) [2023-07-17T16:12:48Z] <elukey> stop kafka-main codfw maintenance - T341558

Mentioned in SAL (#wikimedia-operations) [2023-07-18T07:16:46Z] <elukey> restart kafka main-codfw rebalances (long maintenance) - T341558

I haven't done all the moves, since the current status seems ok to me. Relevant highlights:

  • The data stored on each broker is more balanced now, it may vary of course if we push more traffic to the cluster.

Screenshot from 2023-07-19 10-38-09.png (2×1 px, 146 KB)

  • Partition leaders on 2004/2005 are way more now (~+20 on each broker), the difference with the rest is still big but we also have a lot of topic/partitions with zero traffic (that count on the traffic imbalance). We could extend the rebalance to all the topics, it would take more but we'd definitely improve this metric as well.

Screenshot from 2023-07-19 10-42-26.png (2×1 px, 51 KB)

Created the main-eqiad plan with:

./topicmappr rebuild --force-rebuild --topics  __consumer_offsets,__transaction_state,codfw.changeprop.error,codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,codfw.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,codfw.eventgate-main.test.event,codfw.maps.tiles_change,codfw.mediainfo-streaming-updater.mutation,codfw.mediawiki.image_suggestions_feedback,codfw.mediawiki.job.RecordLintJob,codfw.mediawiki.job.activityUpdateJob,codfw.mediawiki.job.cdnPurge,codfw.mediawiki.job.cirrusSearchElasticaWrite,codfw.mediawiki.job.flaggedrevs_CacheUpdate,codfw.mediawiki.job.ipinfoLogIPInfoAccess,codfw.mediawiki.job.newUserMessageJob,codfw.mediawiki.job.newcomerTasksCacheRefreshJob,codfw.mediawiki.job.parsoidCachePrewarm,codfw.mediawiki.job.refreshLinksPrioritized,codfw.mediawiki.job.setUserMentorDatabaseJob,codfw.mediawiki.job.updateBetaFeaturesUserCounts,codfw.mediawiki.job.userOptionsUpdate,codfw.mediawiki.job.wikibase-addUsagesForPage,codfw.mediawiki.page_change.v1,codfw.mediawiki.page_outlink_topic_prediction_change.v1,codfw.mediawiki.recentchange,codfw.mediawiki.revision-recommendation-create,codfw.mediawiki.revision-score,codfw.mediawiki.revision-score-test,codfw.mediawiki.revision_score_articlequality,codfw.mediawiki.revision_score_articletopic,codfw.mediawiki.revision_score_damaging,codfw.mediawiki.revision_score_draftquality,codfw.mediawiki.revision_score_drafttopic,codfw.mediawiki.revision_score_goodfaith,codfw.mediawiki.revision_score_reverted,codfw.rdf-streaming-updater.fetch-failure,codfw.rdf-streaming-updater.lapsed-action,codfw.rdf-streaming-updater.mutation,codfw.rdf-streaming-updater.reconcile,codfw.rdf-streaming-updater.state-inconsistency,codfw.resource-purge,codfw.resource_change,eqiad.change-prop.backlinks.resource-change,eqiad.change-prop.transcludes.resource-change,eqiad.change-prop.wikidata.resource-change,eqiad.changeprop.error,eqiad.changeprop.retry.change-prop.backlinks.resource-change,eqiad.changeprop.retry.change-prop.transcludes.resource-change,eqiad.changeprop.retry.change-prop.wikidata.resource-change,eqiad.changeprop.retry.mediawiki.page-create,eqiad.changeprop.retry.mediawiki.page-delete,eqiad.changeprop.retry.mediawiki.revision-create,eqiad.changeprop.retry.resource_change,eqiad.cpjobqueue.error,eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate,eqiad.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite,eqiad.cpjobqueue.retry.cpjobqueue.partitioned.mediawiki.job.refreshLinks,eqiad.cpjobqueue.retry.mediawiki.job.ORESFetchScoreJob,eqiad.cpjobqueue.retry.mediawiki.job.RecordLintJob,eqiad.cpjobqueue.retry.mediawiki.job.RenderTranslationPageJob,eqiad.cpjobqueue.retry.mediawiki.job.categoryMembershipChange,eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchCheckerJob,eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchLinksUpdate,eqiad.cpjobqueue.retry.mediawiki.job.parsoidCachePrewarm,eqiad.cpjobqueue.retry.mediawiki.job.refreshLinksPrioritized,eqiad.cpjobqueue.retry.mediawiki.job.wikibase-addUsagesForPage,eqiad.eventgate-main.error.validation,eqiad.eventgate-main.test.event,eqiad.maps.tiles_change,eqiad.mediainfo-streaming-updater.mutation,eqiad.mediawiki.image_suggestions_feedback,eqiad.mediawiki.job.ArticleChangedJob,eqiad.mediawiki.job.AssembleUploadChunks,eqiad.mediawiki.job.CentralAuthCreateLocalAccountJob,eqiad.mediawiki.job.ChangeDeletionNotification,eqiad.mediawiki.job.CleanTermsIfUnused,eqiad.mediawiki.job.CognateCacheUpdateJob,eqiad.mediawiki.job.CognateLocalJobSubmitJob,eqiad.mediawiki.job.DispatchChangeDeletionNotification,eqiad.mediawiki.job.DispatchChanges,eqiad.mediawiki.job.EchoNotificationDeleteJob,eqiad.mediawiki.job.EchoPushNotificationRequest,eqiad.mediawiki.job.EntityChangeNotification,eqiad.mediawiki.job.GlobalUserPageLocalJobSubmitJob,eqiad.mediawiki.job.LocalGlobalUserPageCacheUpdateJob,eqiad.mediawiki.job.LoginNotifyChecks,eqiad.mediawiki.job.MessageGroupStatesUpdaterJob,eqiad.mediawiki.job.MessageGroupStatsRebuildJob,eqiad.mediawiki.job.MessageIndexRebuildJob,eqiad.mediawiki.job.ORESFetchScoreJob,eqiad.mediawiki.job.PublishStashedFile,eqiad.mediawiki.job.PurgeEntityData,eqiad.mediawiki.job.RecordLintJob,eqiad.mediawiki.job.RenderTranslationPageJob,eqiad.mediawiki.job.TTMServerMessageUpdateJob,eqiad.mediawiki.job.ThumbnailRender,eqiad.mediawiki.job.UpdateRepoOnDelete,eqiad.mediawiki.job.UpdateRepoOnMove,eqiad.mediawiki.job.UpdateTranslatablePageJob,eqiad.mediawiki.job.activityUpdateJob,eqiad.mediawiki.job.categoryMembershipChange,eqiad.mediawiki.job.cdnPurge,eqiad.mediawiki.job.checkuserPruneCheckUserDataJob,eqiad.mediawiki.job.cirrusSearchDeleteArchive,eqiad.mediawiki.job.cirrusSearchDeletePages,eqiad.mediawiki.job.cirrusSearchElasticaWrite,eqiad.mediawiki.job.cirrusSearchLinksUpdate,eqiad.mediawiki.job.cirrusSearchLinksUpdatePrioritized,eqiad.mediawiki.job.cirrusSearchOtherIndex,eqiad.mediawiki.job.compileArticleMetadata,eqiad.mediawiki.job.constraintsRunCheck,eqiad.mediawiki.job.constraintsTableUpdate,eqiad.mediawiki.job.deletePage,eqiad.mediawiki.job.enotifNotify,eqiad.mediawiki.job.fetchGoogleCloudVisionAnnotations,eqiad.mediawiki.job.flaggedrevs_CacheUpdate,eqiad.mediawiki.job.globalUsageCachePurge,eqiad.mediawiki.job.htmlCacheUpdate,eqiad.mediawiki.job.ipinfoLogIPInfoAccess,eqiad.mediawiki.job.newUserMessageJob,eqiad.mediawiki.job.newcomerTasksCacheRefreshJob,eqiad.mediawiki.job.notificationGetStartedJob,eqiad.mediawiki.job.notificationKeepGoingJob,eqiad.mediawiki.job.parsoidCachePrewarm,eqiad.mediawiki.job.recentChangesUpdate,eqiad.mediawiki.job.refreshLinks,eqiad.mediawiki.job.refreshLinksPrioritized,eqiad.mediawiki.job.refreshUserImpactJob,eqiad.mediawiki.job.revertedTagUpdate,eqiad.mediawiki.job.setUserMentorDatabaseJob,eqiad.mediawiki.job.updateBetaFeaturesUserCounts,eqiad.mediawiki.job.userOptionsUpdate,eqiad.mediawiki.job.watchlistExpiry,eqiad.mediawiki.job.wikibase-InjectRCRecords,eqiad.mediawiki.job.wikibase-addUsagesForPage,eqiad.mediawiki.page-create,eqiad.mediawiki.page-delete,eqiad.mediawiki.page-links-change,eqiad.mediawiki.page-move,eqiad.mediawiki.page-properties-change,eqiad.mediawiki.page-restrictions-change,eqiad.mediawiki.page-undelete,eqiad.mediawiki.page_change.v1,eqiad.mediawiki.page_outlink_topic_prediction_change.v1,eqiad.mediawiki.recentchange,eqiad.mediawiki.revision-create,eqiad.mediawiki.revision-recommendation-create,eqiad.mediawiki.revision-score,eqiad.mediawiki.revision-score-test,eqiad.mediawiki.revision-tags-change,eqiad.mediawiki.revision-visibility-change,eqiad.mediawiki.revision_score_articlequality,eqiad.mediawiki.revision_score_articletopic,eqiad.mediawiki.revision_score_damaging,eqiad.mediawiki.revision_score_draftquality,eqiad.mediawiki.revision_score_drafttopic,eqiad.mediawiki.revision_score_goodfaith,eqiad.mediawiki.revision_score_reverted,eqiad.mediawiki.user-blocks-change,eqiad.rdf-streaming-updater.fetch-failure,eqiad.rdf-streaming-updater.lapsed-action,eqiad.rdf-streaming-updater.mutation,eqiad.rdf-streaming-updater.reconcile,eqiad.rdf-streaming-updater.state-inconsistency,eqiad.resource-purge,eqiad.resource_change,staging.eventgate-main.test.event,statsv --zk-addr conf1007.eqiad.wmnet:2181 --optimize-leadership --brokers -2

Committed to https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-eqiad

Mentioned in SAL (#wikimedia-operations) [2023-07-20T06:37:16Z] <elukey> start kafka main eqiad maintenance (partitions rebalancing) - T341558

Mentioned in SAL (#wikimedia-operations) [2023-07-20T15:31:14Z] <elukey> stop kafka main eqiad maintenance - T341558

Change 941315 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::main: increase worker threads for kafka-main1001

https://gerrit.wikimedia.org/r/941315

Change 941315 merged by Elukey:

[operations/puppet@production] role::kafka::main: increase worker threads for kafka-main1001

https://gerrit.wikimedia.org/r/941315

Change 941362 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: fix settings passed to the confluent class

https://gerrit.wikimedia.org/r/941362

Change 941362 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: fix settings passed to the confluent class

https://gerrit.wikimedia.org/r/941362

Mentioned in SAL (#wikimedia-operations) [2023-07-25T09:50:55Z] <elukey> restart kafka on kafka-main1001 to pick up the new changes - T341558

Change 941396 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::main: apply new threads settings to all brokers

https://gerrit.wikimedia.org/r/941396

Change 941396 merged by Elukey:

[operations/puppet@production] role::kafka::main: apply new threads settings to all brokers

https://gerrit.wikimedia.org/r/941396

Screenshot from 2023-07-26 09-07-43.png (1×2 px, 102 KB)

After the change in threads the brokers are out of the "danger" zone (below 30% of idle time, as suggested by upstream) but this task should proceed nonetheless.

Next steps:

  • Use rebuild or rebalance in topicmappr with all topics in main-codfw (as opposed to only the most trafficated ones, like I did up to now) to reach a true equal distribution of partitions among brokers. It will take more time but it is a long term stability step that we should take.
  • Do the same in eqiad.

Given the graph above this work is not super urgent, so it can be one with a slower pace.

Change 942061 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::main: raise num.io.threads to 8

https://gerrit.wikimedia.org/r/942061

Change 942061 merged by Elukey:

[operations/puppet@production] role::kafka::main: raise num.io.threads to 8

https://gerrit.wikimedia.org/r/942061

elukey claimed this task.

Going to close this task since the bulk of the work is done, and I'll open new ones to fine-tune kafka main's status.

If someone lands on this ticket in the future, please note that usage of the tooling has been streamlined and documented: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Rebalance_topic_partitions_to_equalize_broker_storage