Page MenuHomePhabricator

Ensure kube2proxy handles apiserver failure gracefully
Closed, InvalidPublic

Description

When the Toolforge Kubernetes control plane server failed here https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes, containers continued running, but everything died at the proxy level. That seemed like an unnecessary amount of collateral damage.

See if kube2proxy can learn to keep existing proxies up while it waits for the apiserver to come back online.

Event Timeline

Bstorm triaged this task as Medium priority.

So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.

I'd expect it to have deleted them all before dying to cause the outages we were seeing, but it didn't. That makes me wonder what's happening wrt to redis with kube2proxy.

The reason this failed is that the APIserver is required by kube-proxy to correctly route the services. Therefore, it's not kube2proxy, per se. Closing this.