At around 11:50 UTC on Jul 18 something happened on kafka-main2001:
[2024-07-18 11:48:38,778] WARN [ReplicaFetcher replicaId=2001, leaderId=2004, fetcherId=0] Error when sending leader epoch request for Map(eqiad.change-prop.transcludes.resource-change-3 -> 58, codfw.rdf-streaming-updater.reconcile-0 -> 156, eqiad.mediawiki.job.cirrusSearchDeletePages-0 -> 329) (kafka.server.ReplicaFetcherThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:92) at kafka.server.ReplicaFetcherThread.fetchEpochsFromLeader(ReplicaFetcherThread.scala:349) at kafka.server.AbstractFetcherThread.maybeTruncate(AbstractFetcherThread.scala:128) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:100) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2024-07-18 11:48:38,778] WARN [ReplicaFetcher replicaId=2001, leaderId=2003, fetcherId=1] Error when sending leader epoch request for Map(codfw.changeprop.retry.mediawiki.page-delete-0 -> -1, eqiad.mediawiki.job.EntityChangeNotification-0 -> 161, codfw.cpjobqueue.retry.mediawiki.job.MassMessageJob-0 -> -1, eqiad.mediawiki.job.TranslateDeleteJob-0 -> -1, codfw.deleteLinks-0 -> -1, codfw.cpjobqueue.retry.mediawiki.job.GlobalUserPageLocalJobSubmitJob-0 -> -1, eqiad.mediawiki.job.gwtoolsetUploadMediafileJob-0 -> -1, codfw.change-prop.retry.mediawiki.job.MessageIndexRebuildJob-0 -> -1, codfw.cpjobqueue.retry.mediawiki.job.webVideoTranscode-0 -> -1, eqiad.change-prop.retry.change-prop.backlinks.resource-change-0 -> -1, codfw.change-prop.retry.mediawiki.job.LoginNotifyChecks-0 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.DispatchChangeVisibilityNotification-0 -> -1, codfw.eventgate-main.error.validation-0 -> 216, codfw.change-prop.retry.mediawiki.page-properties-change-0 -> -1, codfw.change-prop.partitioned.mediawiki.job.refreshLinks-6 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.clearUserWatchlist-0 -> 257, codfw.mediawiki.centralnotice.campaign-create-0 -> 223, eqiad.mediawiki.job.crosswikiSuppressUser-0 -> -1, eqiad.cpjobqueue.retry.mediawiki.job.AssembleUploadChunks-0 -> 268, codfw.change-prop.retry.mediawiki.job.BounceHandlerJob-0 -> -1, eqiad.change-prop.retry.mediawiki.revision-visibility-change-0 -> -1, eqiad.cirrussearch.page-index-update-1 -> -1, codfw.cpjobqueue.retry.mediawiki.job.TranslationsUpdateJob-0 -> -1) (kafka.server.ReplicaFetcherThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:92) at kafka.server.ReplicaFetcherThread.fetchEpochsFromLeader(ReplicaFetcherThread.scala:349) at kafka.server.AbstractFetcherThread.maybeTruncate(AbstractFetcherThread.scala:128) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:100) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Eventually ending up in:
[2024-07-18 12:03:03,175] ERROR [ReplicaFetcher replicaId=2001, leaderId=2005, fetcherId=3] Error due to (kafka.server.ReplicaFetcherThread) kafka.common.KafkaException: Error processing data for partition eqiad.resource-purge-3 offset 25177513453 at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:204) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1$$anonfun$apply$2.apply(AbstractFetcherThread.scala:169) at scala.Option.foreach(Option.scala:257) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:169) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2$$anonfun$apply$mcV$sp$1.apply(AbstractFetcherThread.scala:166) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply$mcV$sp(AbstractFetcherThread.scala:166) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) at kafka.server.AbstractFetcherThread$$anonfun$processFetchRequest$2.apply(AbstractFetcherThread.scala:166) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:164) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: java.lang.IllegalArgumentException: Out of order offsets found in List(25177513359, 25177513360, 25177513361, 25177513362, 25177513363, 25177513364, 25177513365, 25177513366, 25177513367, 25177513368, 25177513369, 25177513370, 25177513371, 25177513372, 25177513373, 25177513374 [...]
Grafana shows metrics changing around that time: https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&from=1721297649156&to=1721313933973
So far the impact is limited, an alert was raised for out-of-sync/reduced replicas, but we should investigate what's happening. From the SAL I don't see any clear match with actions occurred.