Page MenuHomePhabricator

Calico 3.17.1 kube-controllers fail to reach apiserver at startup
Closed, InvalidPublic

Description

The kube-controllers of calico 3.17.1 fail to reach the apiserver for quite some time after being started

2021-01-07 13:29:49.792 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}                   
W0107 13:29:49.796530       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.                           2021-01-07 13:29:49.800 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2021-01-07 13:29:49.839 [INFO][1] main.go 149: Getting initial config snapshot from datastore                                                                                    2021-01-07 13:29:49.863 [INFO][1] main.go 152: Got initial config snapshot
2021-01-07 13:29:49.864 [INFO][1] watchersyncer.go 89: Start called                                                                                                              2021-01-07 13:29:49.864 [INFO][1] main.go 169: Starting status report routine       
2021-01-07 13:29:49.865 [INFO][1] main.go 402: Starting controller ControllerType="Node"                                                                                         2021-01-07 13:29:49.865 [INFO][1] node_controller.go 138: Starting Node controller
2021-01-07 13:29:49.865 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready                                                                              2021-01-07 13:29:49.865 [INFO][1] node_syncer.go 40: Node controller syncer status updated: wait-for-ready
2021-01-07 13:29:49.865 [INFO][1] watchersyncer.go 147: Starting main event processing loop                                                                                      2021-01-07 13:29:49.865 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-01-07 13:29:49.866 [INFO][1] resources.go 349: Main client watcher loop                                                                                                     2021-01-07 13:29:49.878 [INFO][1] watchercache.go 271: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-01-07 13:29:49.878 [INFO][1] watchersyncer.go 127: Sending status update Status=resync                                                                                      2021-01-07 13:29:49.878 [INFO][1] node_syncer.go 40: Node controller syncer status updated: resync
2021-01-07 13:29:49.878 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches                                                                     2021-01-07 13:29:49.878 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2021-01-07 13:29:49.878 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync                                                                                     2021-01-07 13:29:49.878 [INFO][1] node_syncer.go 40: Node controller syncer status updated: in-sync
2021-01-07 13:29:49.889 [INFO][1] hostendpoints.go 90: successfully synced all hostendpoints                                                                                     2021-01-07 13:29:49.965 [INFO][1] node_controller.go 151: Node controller is now running
2021-01-07 13:29:49.965 [INFO][1] ipam.go 45: Synchronizing IPAM data                                                                                                            2021-01-07 13:30:09.885 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/a
pis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded                                                                                             2021-01-07 13:30:09.885 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformati
ons/default": context deadline exceeded
2021-01-07 13:30:41.885 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformatio
ns/default": context deadline exceeded

... last 3 lines repeating every ~20sec ...

2021-01-07 13:44:53.919 [ERROR][1] client.go 261: Error getting cluster information config ClusterInformation="default" error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-01-07 13:44:53.920 [ERROR][1] main.go 207: Failed to verify datastore error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2021-01-07 13:45:25.920 [ERROR][1] main.go 238: Failed to reach apiserver error=Get "https://kubestagemaster.svc.codfw.wmnet:6443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
W0107 13:45:29.196341       1 reflector.go:326] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: read tcp [2620:0:860:babe:9763:6d0c:9ec5:a6]:38386->[2620:0:860:104:10:192:48:10]:6443: read: connection timed out") has prevented the request from succeeding
2021-01-07 13:45:29.196 [INFO][1] ipam.go 45: Synchronizing IPAM data
2021-01-07 13:45:29.196 [INFO][1] resources.go 377: Terminating main client watcher loop
2021-01-07 13:45:29.196 [ERROR][1] runconfig.go 212: error watching KubeControllersConfiguration error=<nil>
2021-01-07 13:45:29.196 [INFO][1] watchercache.go 121: Watch error received from Upstream ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=an error on the server ("unable to decode an event from the watch stream: read tcp [2620:0:860:babe:9763:6d0c:9ec5:a6]:38386->[2620:0:860:104:10:192:48:10]:6443: read: connection timed out") has prevented the request from succeeding
2021-01-07 13:45:29.196 [INFO][1] watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2021-01-07 13:45:29.546 [INFO][1] ipam.go 191: Node and IPAM data is in sync
2021-01-07 13:45:30.222 [INFO][1] resources.go 349: Main client watcher loop

After that, everything is fine and the kube-controllers work as expected.

Logs are from docker-registry.wikimedia.org/calico/kube-controllers:v3.17.1 running on kubestage2002.codfw.wmnet

This looks very similar to:

But I don't get which change may have caused this (as there are only two between 3.17.0 and 3.17.1:
https://github.com/projectcalico/kube-controllers/compare/v3.17.0...v3.17.1
With 3.17.0 I'm unable to reproduce this.