During replacement of a kafka-main broker in T363210: kafka-main200[6789] and kafka-main2010 implementation tracking we experienced a severe eventgate-main outage while the kafka cluster was still working with at least 4 brokers being reachable by eventgate-main all the time.
We believe that the replacement of the broker more or less saturated it's link when it came back up (without data), leading to increased time to ACK a clients event submission. eventgate-main readiness probes then started failing for all of its replicas (without a visible log message, apart from a timeout on the caller side), leading to the service being unavailable for it's clients as all replicas where removed from load balancing (as they where considered not ready).
Increasing the number of replicas from 10 to 20 as well as increasing the timeout for the readiness probe from 1 to 10 seconds made evetgate-main become available again.
There are multiple things to consider here:
- The endpoint used as readiness probe, /v1/_test/events does send events to kafka. This means it does not only test the readiness of it's service (eventgate) but also the readiness of its backing-store which allows for cascading errors like we experienced.
- A 1s timeout is pretty short for a probe that does an RPC to another service
- eventgate does require ACKs from all brokers (or does just the /v1/_test/events do so?), it could/should be lowered to 2
Regarding 1) a discussion was started in the incident doc about if it's the right approach to remove the dependency to kafka for readiness checks:
- Eventgate’s healthchecking does not test eventgate, it tests kafka - this needs to be changed and may have been a culprit in this outage
AO: Perhaps, but I’m not sure if this is a culprit. If eventgate cannot produce to Kafka, it is broken. In this case, Kafka overloaded was maybe the problem, but in other cases, it could be misconfigured pods. In that case, it would be better to fail a deployment and rollback.
JM: That could be checked during startup maybe, rather than regularly in the probes. In general I think it’s not super helpful when eventgate goes into a failure state when there are issues with kafka. It would produce errors to cliens (50x), which is fine. But marking all eventgate instances as not ready will lead to connection timeouts (from the clients).
AO: +1, I’d like to keep this for the readinessProbe, but it is not necessary for the alivenessProbe, does that sound right?
JM: It is only in the readinessProbe for now. The problem is that if there is a problem with kafka, we actively remove all instances of eventgate from loadbalancing (because of the readinessProbe failing). Making it inaccessible for clients which will potentially make them fail late because they have to wait for a timeout to kick in.
AO: won’t that still be true if Kafka is not producible due to overload? The client request to eventgate will wait for the kafka produce request to fail, causing them to fail late?
JM: Depends. Eventgate could decide on a “proper” timeout in that case and would not have to leave that to the clients (making it non-obvious)
AO: okay! Let’s work it out in a phab task. I’m fine with whatever yall think is best.