Seen following merge of:
- rOPUP1091451c19a6: tools k8s workers: add a mostly-permissive firewall
- rOPUP7785d31e2a95: network::constants: add fake CACHE_MISC for labs
Symptoms:
- Networking failures in Toolforge Kubernetes cluster due to SRC NAT failures
- Networking failures in PAWS Kubernetes cluster due to IP forwarding failures
Mitigation:
- Toolforge: clush -w @k8s-worker 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart'
- PAWS: clush -w @paws-worker 'sudo iptables -P FORWARD ACCEPT'
Original report:
While trying to access https://tools.wmflabs.org/guc/?user=193.180.154.229 , I got the following nonstandard messages on the page:
Warning: dns_get_record(): A temporary server error occurred. in /data/project/guc/labs-tools-guc/src/IPInfo.php on line 87
Warning: PDO::__construct(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /data/project/guc/labs-tools-guc/src/App.php on line 32
Error: Database error: Unable to connect to s1.web.db.svc.eqiad.wmflabs
TODO (Lessons learned in debugging):
- build an image with reasonable diag tools (dig, ping, traceroute, mtr, ...)
- Run a serviceset that places a diagnostic pod on all worker nodes
- Have an easy command to list all pods on a node (get pods --all-namespaces -o wide|grep tools-worker-1002)
- runbook page for flannel debugging
- Have an easy command to start a new pod on a given node (https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
- Monitoring and alert on pod dns failures - right now we know because irc bots go away when this happens