Page MenuHomePhabricator

kubernetes1005 BGP down for 3 weeks
Closed, ResolvedPublic

Description

cr1-eqiad is alerting as BGP has been down for 3 weeks towards kubernetes1005:

10.64.0.145           64601          0          0       0      10 3w6d 17:40:24 Active
2620:0:861:101:10:64:0:121       64601     949312     897439       0       3 14w0d 21:27:07 Establ

And indeed it looks like there is nothing listening on port 179 for that host.

Event Timeline

ayounsi triaged this task as High priority.Aug 18 2021, 8:22 AM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 713611 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] BGP Icinga check, critical for k8s clusters

https://gerrit.wikimedia.org/r/713611

This happened while I was running docker pull tests 2021-07-21 ~15:04Z and kubernetes1005 is one of the dedicated sessionstore nodes running on ganeti. The hosts current state is that it is not running calico (that's why the BGP session is down) but it has not detected that it is no longer running. Which, to my surprise, seems to not be something calico supports: https://github.com/projectcalico/node/issues/519

kubectl describe node kubernetes1005.eqiad.wmnet
Name:               kubernetes1005.eqiad.wmnet
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    dedicated=kask
                    failure-domain.beta.kubernetes.io/region=eqiad
                    failure-domain.beta.kubernetes.io/zone=row-a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=kubernetes1005.eqiad.wmnet
                    kubernetes.io/os=linux
                    node.kubernetes.io/disk-type=kvm
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 10.64.0.145/22
                    projectcalico.org/IPv6Address: 2620:0:861:101:10:64:0:145/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 23 Mar 2021 09:44:36 +0000
Taints:             dedicated=kask:NoExecute
                    dedicated=kask:NoSchedule
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 21 Jul 2021 14:34:27 +0000   Wed, 21 Jul 2021 14:34:27 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 18 Aug 2021 09:13:37 +0000   Tue, 23 Mar 2021 09:44:36 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 18 Aug 2021 09:13:37 +0000   Wed, 21 Jul 2021 14:50:28 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 18 Aug 2021 09:13:37 +0000   Tue, 23 Mar 2021 09:44:36 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 18 Aug 2021 09:13:37 +0000   Mon, 09 Aug 2021 13:34:30 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.64.0.145
  Hostname:    kubernetes1005.eqiad.wmnet
Capacity:
 cpu:                15
 ephemeral-storage:  9771516Ki
 hugepages-2Mi:      0
 memory:             4039148Ki
 pods:               110
Allocatable:
 cpu:                15
 ephemeral-storage:  9005429131
 hugepages-2Mi:      0
 memory:             3936748Ki
 pods:               110
System Info:
 Machine ID:                 6ec1715e151f40418f77b85a6977fe6e
 System UUID:                73d02206-87b6-4071-afdc-1df24f562d05
 Boot ID:                    5d20f931-8bc4-4c7d-94cc-0557d2a5e239
 Kernel Version:             4.19.0-0.bpo.14-amd64
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.3
 Kubelet Version:            v1.16.15
 Kube-Proxy Version:         v1.16.15
Non-terminated Pods:         (6 in total)
  Namespace                  Name                                CPU Requests  CPU Limits   Memory Requests  Memory Limits  AGE
  ---------                  ----                                ------------  ----------   ---------------  -------------  ---
  sessionstore               kask-production-6895c57d44-87qkq    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
  sessionstore               kask-production-6895c57d44-jdcm6    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
  sessionstore               kask-production-6895c57d44-jnrmg    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
  sessionstore               kask-production-6895c57d44-lm2qt    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
  sessionstore               kask-production-6895c57d44-n5xsr    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
  sessionstore               kask-production-6895c57d44-v8mlq    2500m (16%)   2500m (16%)  400Mi (10%)      400Mi (10%)    27d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                15 (100%)     15 (100%)
  memory             2400Mi (62%)  2400Mi (62%)
  ephemeral-storage  0 (0%)        0 (0%)
Events:              <none>

The sessionstore pods are all in CrashLoopBackOff because readiness probes fail.

Change 713616 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] sre/kubernetes: Add alerting for nodes not running calico

https://gerrit.wikimedia.org/r/713616

Change 713611 merged by Ayounsi:

[operations/puppet@production] BGP Icinga check, critical for k8s clusters

https://gerrit.wikimedia.org/r/713611

Change 713616 merged by JMeybohm:

[operations/alerts@master] sre/kubernetes: Add alerting for nodes not running calico

https://gerrit.wikimedia.org/r/713616

K8s event logs (https://logstash.wikimedia.org/goto/b16700661b703799af5ac188db2d3f5c) are pretty clear on that I created a lot of disk pressure on the ganeti k8s nodes (small disks) which lead to evictions. It's still not clear to me why the scheduler gave up on re-starting the calico-node daemonset.

As for calico-node being evicted at all: I think this is because we did not enable the Priority admission plugin in https://gerrit.wikimedia.org/r/c/operations/puppet/+/677922. AIUI setting priorityClassName: system-node-criticalon calico-node Pods does not have any effect without the admission plugin enabled (created T289131 for that).

Change 713634 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] sre/kubernetes: Add runbook link for KubernetesCalicoDown

https://gerrit.wikimedia.org/r/713634

Ok, really dumb situation!
A bunch of (failing) sessionstore Pods are clogging all resources on kubernetes1005, leaving no room for the savior calico-node pod to be scheduled:

LAST SEEN   TYPE      REASON             OBJECT                  MESSAGE
27d         Warning   FailedScheduling   pod/calico-node-84vtj   0/17 nodes are available: 1 Insufficient cpu, 16 node(s) didn't match node selector.

What I did to clean this up:

kubectl cordon kubernetes1005.eqiad.wmnet
kubectl get pods -n sessionstore --field-selector spec.nodeName=kubernetes1005.eqiad.wmnet -o name | xargs kubectl -n sessionstore delete
# this was a bunch because lots of Evicted pods still hanging around
kubectl uncordon kubernetes1005.eqiad.wmnet
# kubectl get pods -n sessionstore -o wide | grep -v Evicted
# rebalance by deleting two pods on nodes with > 2 pods running

Change 713634 merged by Filippo Giunchedi:

[operations/alerts@master] sre/kubernetes: Add runbook link for KubernetesCalicoDown

https://gerrit.wikimedia.org/r/713634