Page MenuHomePhabricator

Kubernetes and docker packages for stretch are needed for toolforge bastions
Closed, DeclinedPublic

Description

The current packages either derive from the SRE team's k8s system setup. This set up is incomplete for stretch and very out of step with common practices around k8s, which limits toolforge flexibility and upgrades.

Since it is currently not possible to stand up a bastion that can talk to both Son of Grid Engine and Kubernetes at the same time (since SGE = stretch), this is a blocker for proceeding with the full Trusty deprecation in Tools.

To unblock the quarterly goal, I suggest we simply get packages for kubernetes-node, kubernetes-client, flannel and docker-ce in the tools repo that are compatible with stretch. This should be sufficient with current puppet code (with some modifications on our profile) to stand up a bastion that should work with the rest of the environment.

Since flannel is not normally packaged, rather it is installed with kubeadm, we will have to make our own package for that.

Ultimately, it may be necessary to use kubeadm due to the structure of upstream packaging.

Event Timeline

Bstorm created this task.

Change 473822 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: unbreak the bastion profile

https://gerrit.wikimedia.org/r/473822

Change 473822 merged by Bstorm:
[operations/puppet@production] sonofgridengine: unbreak the bastion profile

https://gerrit.wikimedia.org/r/473822

Let me know if I can be of any help :-)

@aborrero I dare say you can be. We will probably both need to mirror updated k8s stretch packages and docker-ce stretch packages into tools aptly and then hack some puppet around them so that our setup can be maintained. To unblock the grid upgrade, all we need is a kubernetes-client with all it needs to get by. Part of that is likely a flannel package (which we'd have to invent) or flannel installed via kubeadm. I'll have to dig deeper to be sure exactly what is required to just get a bastion talking to both existing k8s and sonofgridengine with minimal tech debt for the next phases of k8s upgrades.

aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

Claiming this task while I work on this.

I would need some additional clarification so we are in the same track. Mind that I'm fairly new to this problem :-)

  • kubernetes-client is available in stretch in version 1.10.6-1 from our internal repo. Also available in Debian is 1.7.16+dfsg-1 (currently only in sid, perhaps could be rebuild for stretch https://tracker.debian.org/pkg/kubernetes)

From your comments I understand our internal repo version doesn't work for us, right? What issue do you see with it? Also, would the Debian version in sid work for us?

aborrero@toolsbeta-sgebastion-03:~$ apt-cache policy kubernetes-client
kubernetes-client:
  Installed: 1.10.6-1
  Candidate: 1.10.6-1
  Version table:
 *** 1.10.6-1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
  • docker-ce, my understanding is that this package is either docker.io (from Debian) or docker-engine (from our internal apt repo).

Which concrete version do we need and why (why do we need to have docker in the bastions)?

aborrero@toolsbeta-sgebastion-03:~$ apt-cache policy docker-engine docker.io
docker-engine:
  Installed: 1.12.6-0~debian-jessie
  Candidate: 1.12.6-0~debian-jessie
  Version table:
 *** 1.12.6-0~debian-jessie 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/k8s amd64 Packages
        100 /var/lib/dpkg/status
docker.io:
  Installed: (none)
  Candidate: 1.6.2~dfsg1-1~bpo8+1
  Version table:
     1.6.2~dfsg1-1~bpo8+1 100
        100 http://mirrors.wikimedia.org/debian jessie-backports/main amd64 Packages

But it was mentioned that we don't really need to run flannel in the bastions, right?

Not answering @aborrero's questions, just providing some datapoints.

Although Kubernetes only supports one version backward/forward, I was able to use kubectl 1.10.10 just fine with webservice shell to connect to our 1.4.6 cluster. The rest of the webservice interactions use pykube which should be fine since we aren't updating that dependency.

However, it's possible we could run into issues because we have users running kubectl directly. I've checked the shell histories and we have everything in there (people editing deployments directly, exec'ing, applying manifests, etc). All of those could break if we go above versions 1.4.x-1.5.x

Lastly, just reiterating we are not extending the overlay network to the bastions, so flannel/docker wouldn't be necessary. We have kubernetes-node installed on the bastions but that doesn't seem necessary, as I think Brooke already mentioned. It doesn't seem the docker daemon is being used for anything either.

# apt-cache depends  kubernetes-node kubernetes-client python-pykube
kubernetes-node
  Depends: libc6
  Depends: adduser
 |Depends: docker.io
  Depends: <docker-engine>
  Recommends: kubernetes-client
kubernetes-client
python-pykube
  Depends: python
  Depends: python
  Depends: <python:any>
    python
  Depends: python-requests
  Depends: python-yaml
  Depends: python-six

docker.io (from Debian) or docker-engine are both ancient and practically broken at this point. docker-ce is only available fro docker upstream. Some of the internal folks use it on some cases, but I think mostly CI.

All that said, I'm gonna try something silly and see if I can *just* install kubernetes-client and see what happens. If it's that easy, I'll be pleased and annoyed at the same time! :)

Change 475124 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: This may be all that is needed for now on bastions

https://gerrit.wikimedia.org/r/475124

Change 475124 merged by Bstorm:
[operations/puppet@production] sonofgridengine: This may be all that is needed for now on bastions

https://gerrit.wikimedia.org/r/475124

$ kubectl get pods --insecure-skip-tls-verify=true
Error from server: client: etcd cluster is unavailable or misconfigured

This could mean many things. At first glance, it suggests that we need etcd, but that doesn't seem to make sense. It could be a blocked thing, or that we need a cert that is buried in the default puppetization, and are actually still blocked by that.

Note that for cross-region k8s, it was necessary to add a security group to the k8s master in toolsbeta with

ALLOW 6443:6443/tcp from 172.16.0.0/21

That's how I got to this point. This could be etcd being down, version mismatch problems, etc.

I could connect using this:

root@toolsbeta-sgebastion-03:~# kubectl get pods
The connection to the server localhost:8080 was refused - did you specify the right host or port?
root@toolsbeta-sgebastion-03:~# kubectl get pods --insecure-skip-tls-verify=true --kubeconfig=/etc/kubernetes/kubeconfig 
No resources found.

With this config (was already in the server, I didn't write this. It was you @Bstorm ?):

root@toolsbeta-sgebastion-03:~# cat /etc/kubernetes/kubeconfig 
apiVersion: v1
kind: Config
preferences: {}
clusters:
  - cluster:
      server: https://toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs:6443
    name: default
contexts:
  - context:
      cluster: default
      user: client-infrastructure
    name: default
current-context: default
users:
  - name: client-infrastructure
    user:
      token: faketoken1

Not sure why, I can even run it without the SSL think:

root@toolsbeta-sgebastion-03:~# kubectl get pods --kubeconfig=/etc/kubernetes/kubeconfig
No resources found.

So, before I keep investigating I request a status update on this :-)

No I didn't write it. That was part of some puppetization or another. Users have a local config that is generally the predominant one in their project folder. maintain-kubeusers creates it. The way to test is generally to become test and then try. This makes me think that the more recent version of the client requires a difference in the config. If that's all it is, I'll be quite happy.

For the record:

toolsbeta.test@toolsbeta-sgebastion-03:~$ kubectl get nodes
NAME                                            STATUS    ROLES     AGE       VERSION
toolsbeta-worker-1001.toolsbeta.eqiad.wmflabs   Ready     <none>    203d      v1.4.6+e569a27

Yeah....I see that. Did we change anything? I mean, if it just works suddenly, I'm pretty much ok with that.

I didn't change a single bit.

To be fair, this is the only thing I remember, commenting this line in /etc/kubernetes/config in toolsbeta-sgebastion-03:

# How the controller-manager, scheduler, and proxy find the apiserver
#KUBE_MASTER="--master=http://toolsbeta-k8s-master-01.toolsbeta.eqiad.wmflabs:6443"

(also, the line was pointing to localhost)

Huh. Ok. So perhaps we need to puppetize that with a simple template or something.

But I really wonder how that config ended in toolsbeta-sgebastion-03.eqiad.wmflabs if its not already in puppet.

I used to have some broken puppet stuff on there. I removed it. We may have installed flannel, node, etcd etc. just to get a proper config.

I can try installing a clean sgebastion to see what that gets :)

a clean bastion (in this case, toolsbeta-sgebastion-04) Just Works. That's it, I'm closing this ticket.