Page MenuHomePhabricator

Ensure kube2proxy handles apiserver failure gracefully
Open, NormalPublic

Description

When the Toolforge Kubernetes control plane server failed here https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes, containers continued running, but everything died at the proxy level. That seemed like an unnecessary amount of collateral damage.

See if kube2proxy can learn to keep existing proxies up while it waits for the apiserver to come back online.

Event Timeline

Bstorm triaged this task as Normal priority.Sep 12 2019, 6:58 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2019, 6:58 PM

So on investigation, it doesn't look like it actually removed them all, but at the same time, there's some odd behavior I see. It adds every service to the redis backend on every loop which is wrong.

I'd expect it to have deleted them all before dying to cause the outages we were seeing, but it didn't. That makes me wonder what's happening wrt to redis with kube2proxy.

This comment was removed by Aklapper.