Page MenuHomePhabricator

kafka-main2001 seems out of sync with the rest of the cluster
Closed, ResolvedPublic

Description

At around 11:50 UTC on Jul 18 something happened on kafka-main2001:

[2024-07-18 11:48:38,778] WARN [ReplicaFetcher replicaId=2001, leaderId=2004, fetcherId=0] Error when sending leader epoch request for Map(eqiad.change-prop.transcludes.resource-change-3 -> 58, codfw.rdf-streaming-updater.reconcile-0 -> 156, eqiad.mediawiki.job.cirrusSearchDeletePages-0 -> 329) (kafka.server.ReplicaFetcherThread)
java.net.SocketTimeoutException: Failed to connect within 30000 ms
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:92)
        at kafka.server.ReplicaFetcherThread.fetchEpochsFromLeader(ReplicaFetcherThread.scala:349)
        at kafka.server.AbstractFetcherThread.maybeTruncate(AbstractFetcherThread.scala:128)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:100)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2024-07-18 11:48:38,778] WARN [ReplicaFetcher replicaId=2001, leaderId=2003, fetcherId=1] Error when sending leader epoch request for Map(codfw.changeprop.retry.mediawiki.page-delete-0 -> -1, eqiad.mediawiki.job.EntityChangeNotification-0 -> 161, codfw.cpjobqueue.retry.mediawiki.job.MassMessageJob-0 -> -1, eqiad.mediawiki.job.TranslateDeleteJob-0 -> -1, codfw.deleteLinks-0 -> -1, codfw.cpjobqueue.retry.mediawiki.job.GlobalUserPageLocalJobSubmitJob-0 -> -1, eqiad.mediawiki.job.gwtoolsetUploadMediafileJob-0 -> -1, codfw.change-prop.retry.mediawiki.job.MessageIndexRebuildJob-0 -> -1, codfw.cpjobqueue.retry.mediawiki.job.webVideoTranscode-0 -> -1, eqiad.change-prop.retry.change-prop.backlinks.resource-change-0 -> -1, codfw.change-prop.retry.mediawiki.job.LoginNotifyChecks-0 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.DispatchChangeVisibilityNotification-0 -> -1, codfw.eventgate-main.error.validation-0 -> 216, codfw.change-prop.retry.mediawiki.page-properties-change-0 -> -1, codfw.change-prop.partitioned.mediawiki.job.refreshLinks-6 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.clearUserWatchlist-0 -> 257, codfw.mediawiki.centralnotice.campaign-create-0 -> 223, eqiad.mediawiki.job.crosswikiSuppressUser-0 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.AssembleUploadChunks-0 -> 268, codfw.change-prop.retry.mediawiki.job.BounceHandlerJob-0 -> -1, eqiad.change-prop.retry.mediawiki.revision-visibility-change-0 -> -1, eqiad.cirrussearch.page-index-update-1 -> -1, codfw.cpjobqueue.retry.mediawiki.job.TranslationsUpdateJob-0 -> -1) (kafka.server.ReplicaFetcherThread)
java.net.SocketTimeoutException: Failed to connect within 30000 ms
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:92)
        at kafka.server.ReplicaFetcherThread.fetchEpochsFromLeader(ReplicaFetcherThread.scala:349)
        at kafka.server.AbstractFetcherThread.maybeTruncate(AbstractFetcherThread.scala:128)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:100)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

Eventually ending up in:

[2024-07-18 12:03:03,175] ERROR [ReplicaFetcher replicaId=2001, leaderId=2005, fetcherId=3] Error due to (kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: Error processing data for partition eqiad.resource-purge-3 offset 25177513453
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169)
        at scala.Option.foreach(Option.scala:257)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166)
        at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Caused by: java.lang.IllegalArgumentException: Out of order offsets found in List(25177513359, 25177513360, 25177513361, 25177513362, 25177513363, 25177513364, 25177513365, 25177513366, 25177513367, 25177513368, 25177513369, 25177513370, 25177513371, 25177513372, 25177513373, 25177513374
[...]

Grafana shows metrics changing around that time: https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&from=1721297649156&to=1721313933973

So far the impact is limited, an alert was raised for out-of-sync/reduced replicas, but we should investigate what's happening. From the SAL I don't see any clear match with actions occurred.

Event Timeline

Current status under replicated topics confirms that 2001 is the issue (Isr misses the 2001 tag):

elukey@kafka-main2001:~$ kafka topics --describe | grep Isr | egrep 'Isr: 200[0-9],200[0-9]$'
	Topic: __consumer_offsets	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 8	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 12	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 20	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 24	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 32	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 36	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 44	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: __consumer_offsets	Partition: 48	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __transaction_state	Partition: 3	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __transaction_state	Partition: 7	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: __transaction_state	Partition: 15	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: __transaction_state	Partition: 23	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __transaction_state	Partition: 27	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: __transaction_state	Partition: 35	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: __transaction_state	Partition: 43	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: __transaction_state	Partition: 47	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.TranslateDeleteJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.change-prop.partitioned.mediawiki.job.refreshLinks	Partition: 4	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: codfw.change-prop.retry.LocalGlobalUserPageCacheUpdateJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.change-prop.retry.TranslateDeleteJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.change-prop.retry.TranslationsUpdateJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.change-prop.partitioned.mediawiki.job.refreshLinks	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.mediawiki.job.updateBetaFeaturesUserCounts	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.job.AssembleUploadChunks	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.job.ChangeNotification	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.job.LocalRenameUserJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.job.UpdateRepoOnDelete	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.job.cirrusSearchDeleteArchive	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.change-prop.retry.mediawiki.page-create	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.changeprop.retry.resource_change	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cirrussearch.update_pipeline.fetch_error.rc0	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.cirrussearch.update_pipeline.lapsed_action	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cirrussearch.update_pipeline.update.rc0	Partition: 4	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite	Partition: 4	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	Partition: 1	Leader: 2003	Replicas: 2003,2005,2001	Isr: 2003,2005
	Topic: codfw.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	Partition: 5	Leader: 2005	Replicas: 2005,2003,2001	Isr: 2003,2005
	Topic: codfw.cpjobqueue.retry.mediawiki.job.BounceHandlerNotificationJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.CognateLocalJobSubmitJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.PurgeEntityData	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.ThumbnailRender	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.TranslatablePageMoveJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.TranslateDeleteJob	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.TranslationNotificationsSubmitJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.UpdateTranslatablePageJob	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.cpjobqueue.retry.mediawiki.job.UpdateTranslatorActivity	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.cirrusSearchCheckerJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.cirrusSearchLinksUpdate	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.cirrusSearchOtherIndex	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.enotifNotify	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.enqueue	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.fixDoubleRedirect	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.htmlCacheUpdate	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.cpjobqueue.retry.mediawiki.job.refresLinks	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.cpjobqueue.retry.mediawiki.job.wikibase-InjectRCRecords	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.flaggedrevs_CacheUpdate	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.maps.tiles_change	Partition: 3	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: codfw.mediawiki-page-move	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.cirrussearch.page_rerender.v1	Partition: 2	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.mediawiki.job.BounceHandlerJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.ChangeDeletionNotification	Partition: 0	Leader: 2003	Replicas: 2001,2005,2003	Isr: 2003,2005
	Topic: codfw.mediawiki.job.DispatchChangeDeletionNotification	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.mediawiki.job.DispatchChangeVisibilityNotification	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.mediawiki.job.MessageGroupStatsRebuildJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.RefreshLinksJob	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.mediawiki.job.RenderTranslationPageJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.checkuserPruneCheckUserDataJob	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.mediawiki.job.cirrusSearchElasticaWrite	Partition: 4	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.cirrusSearchIncomingLinkCount	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.mediawiki.job.cirrusSearchLinksUpdatePrioritizde	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.gwtoolsetUploadMetadataJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.notificationKeepGoingJob	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.mediawiki.job.processMediaModeration	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.refreshLinksDynamic	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.mediawiki.job.securePollArchiveElection	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.sendMail	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: codfw.mediawiki.job.translationNotificationJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: codfw.mediawiki.job.webVideoTranscodePrioritized	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: codfw.mediawiki.refreshLinks	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: codfw.mediawiki.revision_score_articlequality	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.mediawiki.revision_score_reverted	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: codfw.rdf-streaming-updater.mutation-scholarly	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.change-prop.partitioned.mediawiki.job.refreshLinks	Partition: 3	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.change-prop.partitioned.mediawiki.job.refreshLinks	Partition: 7	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.change-prop.retry.mediawiki.job.constraintsTableUpdate-0	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.change-prop.transcludes.resource-change	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.change-prop.wikidata.resource-change	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.job.cdnPurge	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.job.cirrusSearchCheckerJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.job.deleteLinks	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.job.flaggedrevs_CacheUpdate	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.job.refreshLinks	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.page-delete	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.page-move	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.page_delete	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.change-prop.retry.mediawiki.page_move	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cirrussearch.update_pipeline.update.rc0	Partition: 1	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite	Partition: 3	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: eqiad.cpjobqueue.partitioned.mediawiki.job.htmlCacheUpdate	Partition: 6	Leader: 2003	Replicas: 2001,2005,2003	Isr: 2003,2005
	Topic: eqiad.cpjobqueue.retry.change-prop.partitioned.mediawiki.job.refreshLinks	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.AutoModeratorSendRevertTalkPageMsgJob	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.BounceHandlerJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.ChangeVisibilityNotification	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.CognateCacheUpdateJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.DeleteTranslatableBundleJob	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.DispatchChangeDeletionNotification	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.DispatchChanges	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.LoginNotifyChecks	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.MessageIndexRebuildJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.MoveTranslatableBundleJob	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.PublishStashedFile	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.RecordLintJob	Partition: 0	Leader: 2005	Replicas: 2005,2004,2001	Isr: 2004,2005
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.RenderTranslationPageJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.checkuserPruneCheckUserDataJob	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.cirrusSearchDeleteArchive	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.constraintsTableUpdate	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.notificationKeepGoingJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.processMediaModeration	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.refreshUserImpactJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.renameUser	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.sendMail	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.cpjobqueue.retry.mediawiki.job.userGroupExpiry	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.eventgate-main.test.event	Partition: 0	Leader: 2005	Replicas: 2005,2001,2003	Isr: 2003,2005
	Topic: eqiad.mediawiki.job.LocalPageMoveJob	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.MassMessageJob	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.MassMessageSubmitJob	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.UpdateMessageBundle	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.mediawiki.job.UpdateTranslatablePageJob	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.mediawiki.job.UploadFromUrl	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.mediawiki.job.cirrusSearchElasticaWrite	Partition: 1	Leader: 2005	Replicas: 2005,2001,2003	Isr: 2003,2005
	Topic: eqiad.mediawiki.job.cirrusSearchIndexArchive	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.mediawiki.job.cirrusSearchMassIndex	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.deleteLinks	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.enotifNotify	Partition: 0	Leader: 2003	Replicas: 2001,2005,2003	Isr: 2003,2005
	Topic: eqiad.mediawiki.job.enqueue	Partition: 0	Leader: 2003	Replicas: 2003,2002,2001	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.parsoidCachePrewarm	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.securePollUnarchiveElection	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002
	Topic: eqiad.mediawiki.job.updateImplementations	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.mediawiki.job.webVideoTranscode	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.mediawiki.page-move	Partition: 0	Leader: 2005	Replicas: 2005,2004,2001	Isr: 2004,2005
	Topic: eqiad.rdf-streaming-updater.reconcile	Partition: 0	Leader: 2003	Replicas: 2003,2004,2001	Isr: 2003,2004
	Topic: eqiad.rdf_streaming_updater.reconcile	Partition: 0	Leader: 2005	Replicas: 2005,2001,2002	Isr: 2002,2005
	Topic: eqiad.resource-purge	Partition: 3	Leader: 2005	Replicas: 2005,2001,2003	Isr: 2003,2005
	Topic: eqiad.swift.search_glent.upload-complete	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: eqiad.swift.search_popularity_score.upload-complete	Partition: 0	Leader: 2003	Replicas: 2003,2001,2002	Isr: 2003,2002
	Topic: massmessage	Partition: 0	Leader: 2003	Replicas: 2001,2003,2002	Isr: 2003,2002
	Topic: webrequest_text	Partition: 0	Leader: 2003	Replicas: 2001,2002,2003	Isr: 2003,2002

I checked the IRC logs and I forgot that around the time of the errors we had a codfw network outage in some rows.

This may be an issue of some partition data getting corrupted, and in most of the other occurrences of similar issues found on the Internet it seems that Kafka should overcome it "naturally" when the partition segments are compacted. I would personally try a broker restart to see if the issue clears out or not though.

Mentioned in SAL (#wikimedia-operations) [2024-07-22T08:07:09Z] <elukey> restart kafka on kafka-main2001 - T370574

Mentioned in SAL (#wikimedia-operations) [2024-07-22T08:32:26Z] <elukey> restart kafka on kafka-main2005 - T370574

Some recovery happened, but I still see this on kafka-main2001 (after the restart):

[2024-07-22 08:06:44,113] ERROR [ReplicaFetcher replicaId=2001, leaderId=2005, fetcherId=3] Error due to (kafka.server.ReplicaFetcherThread)
kafka.common.KafkaException: Error processing data for partition eqiad.resource-purge-3 offset 25177513453

So I think that eqiad.resource-purge-3 is corrupted.

-rw-r--r-- 1 kafka kafka 1073735478 Jul 18 03:28 00000000025151747941.log
-rw-r--r-- 1 kafka kafka    1489260 Jul 18 03:28 00000000025151747941.timeindex
-rw-r--r-- 1 kafka kafka    1042824 Jul 18 08:44 00000000025162380298.index
-rw-r--r-- 1 kafka kafka 1073741264 Jul 18 08:44 00000000025162380298.log
-rw-r--r-- 1 kafka kafka         10 Jul 18 03:28 00000000025162380298.snapshot  <============== size looks weird, plus the timestamp matches with the codfw outage
-rw-r--r-- 1 kafka kafka    1403772 Jul 18 08:44 00000000025162380298.timeindex
-rw-r--r-- 1 kafka kafka   10485760 Jul 22 08:06 00000000025173341555.index
-rw-r--r-- 1 kafka kafka  472949648 Jul 18 11:51 00000000025173341555.log
-rw-r--r-- 1 kafka kafka         10 Jul 18 08:44 00000000025173341555.snapshot        <============== size looks weird, plus the timestamp matches with the codfw outage
-rw-r--r-- 1 kafka kafka   10485756 Jul 22 08:06 00000000025173341555.timeindex
-rw-r--r-- 1 kafka kafka         10 Jul 22 08:06 00000000025177513453.snapshot
-rw-r--r-- 1 kafka kafka         36 Jul 22 04:51 leader-epoch-checkpoint

Mentioned in SAL (#wikimedia-operations) [2024-07-22T10:24:47Z] <elukey> kafka preferred-replica-election on kafka-main - T370574

My proposal:

  • stop kafka on kafka-main2001
  • mv /srv/kafka/data/eqiad.resource-purge-3 /srv/kafka/backup/
  • cleanup zookeeper (not needed afaics)
  • start kafka on 2001

In theory the replica fetcher should realize that something is missing and start pulling from other brokers. The total size is ~17G, so not a ton of data to stream.

The alternative is to hopefully wait for segment log compaction happening this Thursday, that should clean up the problem (but we'd have to live with the misbehaving node).

elukey renamed this task from kafka2001 seems out of sync with the rest of the cluster to kafka-main2001 seems out of sync with the rest of the cluster.Jul 22 2024, 3:42 PM

cleanup zookeeper (not needed afaics)

If you like, you could verify your proposal on the kafka-test cluster with before you do kafka-main

cleanup zookeeper (not needed afaics)

If you like, you could verify your proposal on the kafka-test cluster with before you do kafka-main

This is a great suggestion! Tried to remove eqiad.mediawiki.revision-create-0 (~1.7G) and it worked nicely:

[2024-07-22 15:50:18,018] INFO [ProducerStateManager partition=eqiad.mediawiki.revision-create-0] Writing producer snapshot at offset 325142264 (kafka.log.ProducerStateManager)
[2024-07-22 15:50:52,370] INFO Replica loaded for partition eqiad.mediawiki.revision-create-0 with initial high watermark 0 (kafka.cluster.Replica)
[2024-07-22 15:50:52,375] INFO [Log partition=eqiad.mediawiki.revision-create-0, dir=/srv/kafka/data] Loading producer state from offset 0 with message format version 2 (kafka.log.Log)
[2024-07-22 15:50:52,377] INFO [Log partition=eqiad.mediawiki.revision-create-0, dir=/srv/kafka/data] Completed load of log with 1 segments, log start offset 0 and log end offset 0 in 2 ms (kafka.log.Log)

Mentioned in SAL (#wikimedia-operations) [2024-07-22T16:02:33Z] <elukey> remove /srv/kafka/data/eqiad.resource-purge-3 on kafka-main2001 to force a refetch of data from good replicas and circumvent data corruption - T370574

elukey claimed this task.

All recovered!