Page MenuHomePhabricator

Improve eventgates health check/readiness probe
Closed, ResolvedPublic

Description

During replacement of a kafka-main broker in T363210: kafka-main200[6789] and kafka-main2010 implementation tracking we experienced a severe eventgate-main outage while the kafka cluster was still working with at least 4 brokers being reachable by eventgate-main all the time.

We believe that the replacement of the broker more or less saturated it's link when it came back up (without data), leading to increased time to ACK a clients event submission. eventgate-main readiness probes then started failing for all of its replicas (without a visible log message, apart from a timeout on the caller side), leading to the service being unavailable for it's clients as all replicas where removed from load balancing (as they where considered not ready).
Increasing the number of replicas from 10 to 20 as well as increasing the timeout for the readiness probe from 1 to 10 seconds made evetgate-main become available again.

There are multiple things to consider here:

  1. The endpoint used as readiness probe, /v1/_test/events does send events to kafka. This means it does not only test the readiness of it's service (eventgate) but also the readiness of its backing-store which allows for cascading errors like we experienced.
  2. A 1s timeout is pretty short for a probe that does an RPC to another service
  3. eventgate does require ACKs from all brokers (or does just the /v1/_test/events do so?), it could/should be lowered to 2

Regarding 1) a discussion was started in the incident doc about if it's the right approach to remove the dependency to kafka for readiness checks:

  • Eventgate’s healthchecking does not test eventgate, it tests kafka - this needs to be changed and may have been a culprit in this outage

AO: Perhaps, but I’m not sure if this is a culprit. If eventgate cannot produce to Kafka, it is broken. In this case, Kafka overloaded was maybe the problem, but in other cases, it could be misconfigured pods. In that case, it would be better to fail a deployment and rollback.
JM: That could be checked during startup maybe, rather than regularly in the probes. In general I think it’s not super helpful when eventgate goes into a failure state when there are issues with kafka. It would produce errors to cliens (50x), which is fine. But marking all eventgate instances as not ready will lead to connection timeouts (from the clients).
AO: +1, I’d like to keep this for the readinessProbe, but it is not necessary for the alivenessProbe, does that sound right?
JM: It is only in the readinessProbe for now. The problem is that if there is a problem with kafka, we actively remove all instances of eventgate from loadbalancing (because of the readinessProbe failing). Making it inaccessible for clients which will potentially make them fail late because they have to wait for a timeout to kick in.
AO: won’t that still be true if Kafka is not producible due to overload? The client request to eventgate will wait for the kafka produce request to fail, causing them to fail late?
JM: Depends. Eventgate could decide on a “proper” timeout in that case and would not have to leave that to the clients (making it non-obvious)
AO: okay! Let’s work it out in a phab task. I’m fine with whatever yall think is best.

Event Timeline

@Ottomata wrote:

If eventgate cannot produce to Kafka, it is broken. In this case, Kafka overloaded was maybe the problem, but in other cases, it could be misconfigured pods. In that case, it would be better to fail a deployment and rollback.

Andrew, I've been down this road a few times before and I totally agree — but if and only if you run a loadbalancer with protective logic for ignoring health checks when "too many" (~50%-100%) of the backing pods or realservers are failing health checks.

This prevents cascading outages when applications cannot reach one of their dependencies, while still allowing for weird network issues or client library issues or misconfiguration of new instances to influence the readiness probe locally.

PyBal is an example of such a system, as is its upcoming replacement Liberica. AWS Elastic LB is another such example. Both Envoy ("panic threshold") and Istio ("minHealthPercent") have this feature as well.

Unfortunately this is not also true of standard Kubernetes Services (backed by EndpointSlices). If 0 of your pods report Ready, then 0 of them receive traffic. If 10% of your pods report healthy, then that 10% will receive traffic for your whole service, probably very quickly overwhelming them and making them answer not-ready, and then this just oscillates wildly forever until you turn readiness checks off.

So with things as they are now in production, I think it's absolutely correct and necessary to not consider Kafka health in the readiness probe.

To add to that, I understand @Ottomata's sentiment to have a readiness probe that reflects whether it is working correctly end-to-end. However, the readiness probes are servicing a slightly different purpose, IMO. They indicate whether an individual pod is ready to accept traffic. If it is, then its IP will be included in the Service name resolution, by coredns.

I insist on "individual" here, as basing our readiness check on Kafka then links the Kafka latency to the readiness of all pods. When latency went up, all pods suddenly went Not Ready, which caused eventgate-main service name resolution to return an empty IP set. This effectively allowed a pretty minimal incident (increased produce/consume latency on a single broker) to a widespread outage (wikimedia can't talk to eventgate-main anymore).

I think there's value in instrumenting the producing failure rate from eventgate to Kafka, and alert if it goes up, but IMHO, this shouldn't be tied into the pod readiness probe.

Change #1066718 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] eventgate: Offer readinessProbe that does not test kafka

https://gerrit.wikimedia.org/r/1066718

Change #1066719 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] eventgate-main: Disable end-to-end readinessProbe

https://gerrit.wikimedia.org/r/1066719

To unblock the kafka hardware replacements, I've added an alternative readiness probe to the chart which will be used when .Values.main_app.conf.test_events evaluates to false. It just does a GET /_info which is what we do for other servicerunner services as well IMHO. AIUI schema loading happens async anyways, so the service should be ready as soon as the routing is set up, right?

@Ottomata could you please verify the above/patches to unblock the hardware refresh of kafka-main?

Change #1066718 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate: Offer readinessProbe that does not test kafka

https://gerrit.wikimedia.org/r/1066718

Change #1066719 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: Disable end-to-end readinessProbe

https://gerrit.wikimedia.org/r/1066719

Change #1070257 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] eventgate: Disable end-to-end readinessProbe by default

https://gerrit.wikimedia.org/r/1070257

Deployed eventgate-main with the simple readinessProbe and uploaded a change to make it the default in the chart and all current deployments

Change #1070257 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate: Replace end-to-end readinessProbe

https://gerrit.wikimedia.org/r/1070257

JMeybohm claimed this task.

Updated and deployed all eventgates to use the simple readinessProbe while keeping the test endpoint around for manual testing.