Page MenuHomePhabricator

Toolforge: new k8s: issues with the initial coredns setup
Closed, ResolvedPublic

Description

Something is yet to be configured in coredns.

root@toolsbeta-test-k8s-master-1:~# cat busybox.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: busybox
  namespace: default
spec:
  containers:
  - name: busybox
    image: busybox:1.28
    command:
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
root@toolsbeta-test-k8s-master-1:~# kubectl apply -f  busybox.yaml
pod/busybox created
root@toolsbeta-test-k8s-master-1:~# kubectl get pods busybox
NAME      READY   STATUS    RESTARTS   AGE
busybox   1/1     Running   0          10s
root@toolsbeta-test-k8s-master-1:~# kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

The generated configuration file for kubelet by kubadm contains also a reference to this bogus 10.x range:

root@toolsbeta-test-k8s-master-1:~# grep -B1 10.96 /var/lib/kubelet/config.yaml 
clusterDNS:
- 10.96.0.10

Right now, if you create a pod that contacts the API, it tries 10.x:

root@toolsbeta-test-k8s-master-1:~# kubectl logs nginx-ingress-59c8769f89-pzkdb -n nginx-ingress
I0722 13:02:39.559078       1 main.go:155] Starting NGINX Ingress controller Version=edge GitCommit=18ab23a3
F0722 13:03:09.564409       1 main.go:261] Error trying to get the default server TLS secret nginx-ingress/default-server-secret: could not get nginx-ingress/default-server-secret: Get https://10.96.0.1:443/api/v1/namespaces/nginx-ingress/secrets/default-server-secret: dial tcp 10.96.0.1:443: i/o timeout

In this new cluster we are using: podSubnet: "192.168.0.0/16" so my first idea is the kube-apiserver pod should have an address in this range.

Details

Related Gerrit Patches:

Event Timeline

Just found this in the kubeadm init help:

--service-cidr string                  Use alternative range of IP address for service VIPs. (default "10.96.0.0/12")

We may be using this default and hence the failure.

That 10.x range is not bogus. It is the correct default for the service IPs in kubeadm and should work with everything out of the box. Changing that caused some contention between some services and definitions, but it can be made consistent. I don't recommend we change it unless we absolutely have to. It makes no real difference to anything outside the cluster.

nslookup: can't resolve 'kubernetes.default' -- that's not our CoreDNS domain. It's toolsbeta.default, I think. It's in our init config.

If we cannot contact the service IPs, the issue will be there regardless of the range. It could be because of the PSPs only allowing things within kube-system, I've thought, but it also might be something else that needs fixing

I would not expect anything to successfully run outside of kube-system the way I left it, to be clear.

Change 524803 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: put up a default psp to unblock other work

https://gerrit.wikimedia.org/r/524803

That's what I left out. I probably shouldn't have.

Change 524803 merged by Bstorm:
[operations/puppet@production] toolforge: put up a default psp to unblock other work

https://gerrit.wikimedia.org/r/524803

aborrero closed this task as Resolved.Jul 22 2019, 5:18 PM

There was an issue with iptables. Rebooted the worker nodes and everything looks working more or less as expected.

aborrero renamed this task from Toolforge: new k8s: evaluate DNS setup for coredns to Toolforge: new k8s: issues with the initial coredns setup.Thu, Nov 28, 12:30 PM