In case of AWS MSK patching or scheduled maintenance, the brokers are patched and rebooted one at a time. During this process the kafka consumers for example encounter this error which bubbles up to the grpc channel the DAGs run - for example the batches
```
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "sasl_ssl://xxxx.kafka.us-east-1.amazonaws.com:9096/bootstrap: Connect to ipv4#xx.xxx.xxx.xxx:9096 failed: Connection refused (after 1ms in state CONNECT)"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-06-12T10:16:26.064943775+00:00", grpc_status:2, grpc_message:"sasl_ssl://xxxx.kafka.us-east-1.amazonaws.com:9096/bootstrap: Connect to ipv4#xx.xxx.xx.xxx:9096 failed: Connection refused (after 1ms in state CONNECT)"}"
```
Whenever a new consumer pool is created via confluent-kakfa-go, it loads the default settings from librdkafka [[ https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka_conf.c | config ]] in addition to the custom settings.
During broker failures, the kafka client retries indefinitely within these boundaries
```
retry.backoff.ms 100 ms
reconnect.backoff.ms 100 ms
reconnect.backoff.max.ms 10000 ms
```
The `socket.keepalive.enable` is set to `false` by default which if enabled can help to detect dead tcp connections.
The recommended options are as below:
1. Enable the `"socket.keepalive.enable": true` in the kafka configmap setting
2. In case of all the kafka clients code for example: snapshots code, the export handler should handle the kafka.error. The errors listed below should not be immediately surfaced but handled in the api so that the downstream grpc channel does not receive the error [[ https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka.h | rdkafka.h ]]
```
kafka.ErrTransport RD_KAFKA_RESP_ERR__TRANSPORT -195
kafka.ErrAllBrokersDown RD_KAFKA_RESP_ERR__ALL_BROKERS_DOWN -187
kafka.ErrTimedOut RD_KAFKA_RESP_ERR__TIMED_OUT -192
```