Page MenuHomePhabricator

Have PyBal monitor Istio-Ingressgateway health
Closed, DeclinedPublic

Description

Right now we run LVS services for istio-ingressgateway with:

monitors:
  IdleConnection:
    max-delay: 300
    timeout-clean-reconnect: 3

This has the downside of PyBal showing all nodes of a cluster where no ingress route/backend is configured as down as ingressgateways envoy will not accept connections (on it's traffic port: tcp/30443) in that case.

In addition this might not catch errors reported by ingressgateway via it's internal health check (tcp/30021). Although it's currently not sure if there are errors that will result in failing health checks while connections are still possible.

Ingressgateway only servers health checks on a different than the traffic port (tcp/30021). So to allow checking those as well, PyBal's ProxyFetch monitor would need to be extended to allow checking a different port. A proposal CR exists at https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/759749

The above will not help in this particular case.
Kubernetes will internally do health checking on the dedicated health check port (tcp/30021). If that fails it will no longer serve traffic to that ingressgateway instance. In our setup (one ingressgateways per node) this means connections to the ingressgateway traffic port (tcp/30443) as well as to the health check port (tcp/30021) will be dropped by the node (as they are handled the same).
Because of that it seems to be sufficient to just do tcp connection monitoring (for PyBal as well as for monitoring/probes).

Event Timeline

Change 759749 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/pybal@master] Allow to configure a different port for ProxyFetch monitor

https://gerrit.wikimedia.org/r/759749

JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)

Change 759749 abandoned by JMeybohm:

[operations/debs/pybal@master] Allow to configure a different port for ProxyFetch monitor

Reason:

Not needed right now

https://gerrit.wikimedia.org/r/759749