Page MenuHomePhabricator

PAWS is down
Closed, ResolvedPublic

Description

kubectl get commands result in
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding

kubelet does not look happy, from messages on control node 3

Dec  1 02:51:53 paws-k8s-control-3 kubelet[30184]: E1201 02:51:53.900701   30184 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://k8s.svc.paws.eqiad1.wikimedia.cloud:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/paws-k8s-control-3?timeout=10s": context deadline exceeded
Dec  1 02:51:56 paws-k8s-control-3 kubelet[30184]: E1201 02:51:56.131023   30184 reflector.go:138] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:66: Failed to watch *v1.Pod: failed to list *v1.Pod: an error on the server ("") has prevented the request from succeeding (get pods)
Dec  1 02:51:58 paws-k8s-control-3 kubelet[30184]: E1201 02:51:58.116085   30184 kubelet_node_status.go:470] "Error updating node status, will retry" err="error getting node \"paws-k8s-control-3\": Get \"https://k8s.svc.paws.eqiad1.wikimedia.cloud:6443/api/v1/nodes/paws-k8s-control-3?timeout=10s\": context deadline exceeded"

HA proxy nodes do seem to be routing connections

tcp               TIME-WAIT              0                   0                                                                  172.16.1.171:6443                                                   172.16.1.180:50340                          
tcp               TIME-WAIT              0                   0                                                                  172.16.1.171:6443                                                    172.16.1.99:34992                          
tcp               TIME-WAIT              0                   0                                                                  172.16.1.171:6443                                                   172.16.1.180:54744                          
tcp               TIME-WAIT              0                   0                                                                  172.16.1.171:6443                                                    172.16.1.99:35952                          
tcp               TIME-WAIT              0                   0                                                                  172.16.1.171:6443                                                    172.16.1.34:49474

cert seems alright

$ openssl s_client -servername paws.wmcloud.org -connect paws.wmcloud.org:443 2>/dev/null | openssl x509 -noout -dates
notBefore=Nov 18 12:03:13 2022 GMT
notAfter=Feb 16 12:03:12 2023 GMT

Event Timeline

rook changed the task status from Open to In Progress.Dec 1 2022, 2:49 AM
rook updated the task description. (Show Details)

Docker is running on the nodes that I checked, control, worker, ingress, though nothing is responding so it feels kind of network related. The haproxy and ingress nodes seem to be operating, begging the question of what is up. Could be an auth problem, though I would expect paws to continue serving traffic if so.

It seems the api server is failing to connect to etcd:

root@paws-k8s-control-1:~# docker ps -a | grep api
02804609413a   a5a584eef959                                  "kube-apiserver --ad…"   4 minutes ago    Exited (1) 4 minutes ago             k8s_kube-apiserver_kube-apiserver-paws-k8s-control-1_kube-system_0eb38484b8a3ce97e296f3234d5b0161_117
d287e05d8ef6   docker-registry.tools.wmflabs.org/pause:3.1   "/pause"                 5 hours ago      Up 5 hours                           k8s_POD_kube-apiserver-paws-k8s-control-1_kube-system_0eb38484b8a3ce97e296f3234d5b0161_4
root@paws-k8s-control-1:~# docker logs --tail 10 02804609413a
W1201 08:16:07.844938       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:08.601181       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:09.617087       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:11.080041       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:12.569958       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:15.779969       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:17.423037       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:22.719299       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1201 08:16:25.193656       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Error: context deadline exceeded

And in turn etcd is failing to connect to others due to bad cert both from the client and between nodes:

2022-12-01 08:21:46.373782 W | rafthttp: health check for peer bb93bab3d9194239 could not connect: x509: certificate has expired or is not yet valid
2022-12-01 08:21:46.374438 W | rafthttp: health check for peer bb93bab3d9194239 could not connect: x509: certificate has expired or is not yet valid
2022-12-01 08:21:46.376611 W | rafthttp: health check for peer d91a97a61e6f1e60 could not connect: dial tcp 172.16.1.178:2380: connect: connection refused
2022-12-01 08:21:46.377071 W | rafthttp: health check for peer d91a97a61e6f1e60 could not connect: dial tcp 172.16.1.178:2380: connect: connection refused
2022-12-01 08:21:46.437048 I | embed: rejected connection from "172.16.1.99:56920" (error "remote error: tls: bad certificate", ServerName "")

Looking...

Yep, the etcd cert is expired, might be acmechief dead again, will try to trigger a refresh:

root@paws-k8s-control-1:~# openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 2422556947181677617 (0x219ea6b54ceb2c31)
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = etcd-ca
        Validity
            Not Before: May 26 18:05:27 2020 GMT
            Not After : Nov 15 12:58:30 2022 GMT
...

Mentioned in SAL (#wikimedia-cloud) [2022-12-01T08:30:08Z] <taavi> root@paws-k8s-control-1:~# for cert in etcd-server etcd-peer etcd-healthcheck-client; do kubeadm certs renew $cert; done # T324178

taavi claimed this task.