Deploy upgraded Kubernetes to toolsbeta
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Feb 7 2019, 4:34 PM

Description

This is the epic for the first step of putting this up in beta before it goes live.

Details

Subject	Repo	Branch	Lines +/-
toolforge: Rename to toolforge-tool-role.yaml due to typo	operations/puppet	production	+0 -0
new k8s: Fix ingress object and enable toolsbeta ingress creation	operations/software/tools-webservice	master	+35 -16
toolsbeta: allow the host toolsbeta.wmflabs.org	cloud/toolforge/ingress-admission-controller	master	+1 -1
jessie fixes: port the fix from the base image to the jessie-sssd one	operations/docker-images/toollabs-images	master	+21 -0
toolforge: Distribute the roles for toolforge users	operations/puppet	production	+9 -0
deploy: prepare for deployment in toolsbeta	labs/tools/maintain-kubeusers	master	+20 -3
toolforge: refresh puppet code for the new k8s	operations/puppet	production	+13 -287
k8s: kubelet: stop requiring ::k8s::infrastructure_config	operations/puppet	production	+0 -2
toolforge: Update the version string to match our software	operations/puppet	production	+1 -1
toolforge: fix typo kubelet file content	operations/puppet	production	+1 -1
toolforge: add internal pause container to all the other kubelets	operations/puppet	production	+6 -0
toolforge: set kubeadm to use internal registry for pause container	operations/puppet	production	+3 -0
toolforge: k8s: kubadm: calico requires ipset	operations/puppet	production	+1 -0
toolforge: remove class redeclaration	operations/puppet	production	+0 -1
toolforge: include the kubeadm_docker_service	operations/puppet	production	+3 -3
toolforge-etcd: tell etcd to check client certs	operations/puppet	production	+9 -8
toolforge: put the client certs back in for etcd	operations/puppet	production	+48 -93
toolforge-etcd: enable client cert checking	operations/puppet	production	+9 -8
toolforge: Switch up using etcd client certs in k8s a little	operations/puppet	production	+47 -93
toolforge: kubeadm master nodes shouldn't use client certs for etcd	operations/puppet	production	+11 -45
toolforge: k8s: kubeadm: now using external etcd servers	operations/puppet	production	+54 -21
toolforge: create etcd dir for the certs	operations/puppet	production	+10 -2
toolforge: set up cert distribution for additional control plane nodes	operations/puppet	production	+85 -3
toolforge: correct the values for bootstrapping a cluster	operations/puppet	production	+11 -52
toolforge: k8s: apilb: specify path for systemctl call when reloading config	operations/puppet	production	+2 -1
toolforge: k8s: fix half-done code factorization for kubadm_join	operations/puppet	production	+0 -9
toolforge: refactor to join a node to the new cluster	operations/puppet	production	+24 -1
toolforge: k8s: add calico yaml config	operations/puppet	production	+770 -0
toolforge: k8s: factorice preflight checks to the common profile	operations/puppet	production	+29 -23
toolforge: a firewall default of DENY is maddening for kubernetes	operations/puppet	production	+0 -2
toolforge: reload haproxy on subconfig changes	operations/puppet	production	+5 -0
toolforge: add a join configuration to the init setup	operations/puppet	production	+17 -1
toolforge: add calico CRD config	operations/puppet	production	+801 -3
toolforge: make kubeadm config allow stacked control plane	operations/puppet	production	+3 -1
toolforge: k8s: more robust swap deletion operations	operations/puppet	production	+9 -1
toolforge: declare service for docker	operations/puppet	production	+4 -1
toolsforge: fix up the kubeadm config a bit more and configure docker	operations/puppet	production	+22 -1
toolforge: fix the kubeadm config a bit for the master based on testing	operations/puppet	production	+10 -12
toolforge: k8s: handle common kubeadm preflight checks	operations/puppet	production	+15 -0
toolforge: k8s: introduce basic config file for kubeadm	operations/puppet	production	+106 -2
aptrepo: thirdparty/kubeadm-k8s: add cri-tools	operations/puppet	production	+2 -1
toolforge: the kubeadm repo can't be labeled trusted in puppet apparently	operations/puppet	production	+0 -1
toolforge: k8s: add basic kubeadm infra	operations/puppet	production	+68 -0
toolforge: k8s: node: fix template path	operations/puppet	production	+1 -1
toolforge: configure kubernetes node using TLS instead of token auth	operations/puppet	production	+33 -2
toolforge: correct a bunch of the apilb profile	operations/puppet	production	+30 -18
haproxy: make monitoring code optional	operations/puppet	production	+24 -16
toolforge: k8s: master: allow connections to the API from any cloud VM	operations/puppet	production	+1 -1
toolforge: remove junk from k8s::apilb	operations/puppet	production	+0 -7
toolforge: k8s: worker: disable swap	operations/puppet	production	+10 -0
toolforge: k8s: introduce basic k8s node puppet code	operations/puppet	production	+106 -0
toolforge: k8s: specify scheme:// for etcd servers	operations/puppet	production	+4 -3
toolforge: k8s: specify the port of etcd servers	operations/puppet	production	+7 -1
toolforge: migrate k8s code from toollabs	operations/puppet	production	+130 -0
profile: etcd: make peer list configurable	operations/puppet	production	+9 -3
toolforge: migrate etcd code from toollabs	operations/puppet	production	+70 -0

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	• Bstorm	T246122 Upgrade the Toolforge Kubernetes cluster to v1.16
		Restricted Task
Resolved	bd808	T232536 Toolforge Kubernetes internal API down, causing `webservice` and other tooling to fail
Resolved	• Bstorm	T236565 "tools" Cloud VPS project jessie deprecation
Resolved	aborrero	T101651 Set up toolsbeta more fully to help make testing easier
Resolved	• Bstorm	T166949 Homedir/UID info breaks after a while in Tools Kubernetes (can't read replica.my.cnf)
Resolved	• Bstorm	T246059 Add admin account creation to maintain-kubeusers
Resolved	• Bstorm	T154504 Make webservice backend default to kubernetes
Declined	None	T245230 Investigate cpu/ram requests and limits for DaemonSets pods
Resolved	• Bstorm	T214513 Deploy and migrate tools to a Kubernetes v1.15 or newer cluster
Resolved	aborrero	T215531 Deploy upgraded Kubernetes to toolsbeta
Resolved	• Bstorm	T215529 Puppetize/stand up a load balancer for K8s API servers
Resolved	aborrero	T226098 Toolforge: modernize deployment for etcd in k8s
Resolved	aborrero	T228267 Toolforge: iptables flavor for Debian Buster-based k8s cluster
Resolved	aborrero	T228500 Toolforge: evaluate ingress mechanism
Resolved	aborrero	T234032 Toolforge ingress: create a default landing page for unknown/default URLs
Resolved	aborrero	T234037 Toolforge ingress: decide on final layout of north-south proxy setup
Resolved	aborrero	T235059 Toolforge: refresh puppet code for proxy (dynamicproxy) to support Debian Buster
Resolved	• Bstorm	T234231 Toolforge ingress: decide on how ingress configuration objects will be managed
Resolved	• dduvall	T236203 Add CI checks for golang admission controllers
Resolved	• Bstorm	T254293 Change to admission controller readme.md failed to pass gate-and-submit jobs
Resolved	aborrero	T228660 Toolforge: new k8s: issues with the initial coredns setup
Resolved	• Bstorm	T228887 Update pause container in our internal registry
Resolved	• Bstorm	T229009 Proposal: ditching the master name in kubernetes servers
Resolved	• Bstorm	T234702 Review and establish configurable quotas for users in the new Kubernetes cluster
Resolved	aborrero	T236074 Toolforge: rebuild the new k8s toolsbeta deployment and write final docs
Resolved	aborrero	T236249 Toolforge: new k8s: upload internal docker images to our registry
Resolved	aborrero	T236824 Toolforge: new k8s: get new deb packages for 1.15.4 or 1.15.5
Resolved	aborrero	T237443 toolsbeta: new k8s: deploy a front proxy (dynamicproxy)
Resolved	• Bstorm	T237541 CoreDNS in the new k8s cluster cannot talk to the Cloud recursors
Declined	None	T238641 toolforge: some additional testing before final migration
Resolved	aborrero	T239403 toolforge: new k8s: scale up a bit the cluster before final tests and initial migrations
Open	None	T239404 toolforge: new k8s: evaluate DNS (coredns) autoscale options
Resolved	aborrero	T239405 toolforge: new k8s: evaluate ingress controller reload behaviour
Stalled	None	T239406 toolforge: new k8s: evalute and test firewalling via calico
Resolved	aborrero	T238655 toolforge: new k8s: issues with the apiserver and etcd
Resolved	• Bstorm	T215678 Replace each of the custom controllers with something in a new Toolforge Kubernetes setup
Resolved	• Bstorm	T227290 Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy
Resolved	• Bstorm	T238162 Establish a process for renewing TLS certs for the 2 webhook controllers
Resolved	• Bstorm	T236202 Modify webservice and maintain-kubeusers to allow switching to the new cluster
Resolved	• Bstorm	T228499 Toolforge: changes to maintain-kubeusers
Resolved	• Bstorm	T229058 Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one
Resolved	• Bstorm	T237836 `webservice restart` regression with backend=kubernetes in webservice 0.51

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We just tested the lifecycle again, and it seems to work:

root@toolsbeta-test-k8s-master-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@toolsbeta-test-k8s-master-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config
root@toolsbeta-test-k8s-master-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]

For other control plane nodes:

root@toolsbeta-test-k8s-master-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
0e323a45a4212c78994e30f8f3b9a6f77a1b475e696e12e7bf5f7cbd72ea5871
root@toolsbeta-test-k8s-master-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
3637ded9d0ac4e45952214e43b3107055d090ea0c13a176c4607f907662034f1

root@toolsbeta-test-k8s-master-2:~# kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output> --experimental-control-plane --certificate-key <upload_certs_output>
[...]

For worker nodes:

aborrero@toolsbeta-test-k8s-worker-1:~ $ sudo kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output>

Note that:

deleting a node requires kubectl delete node <nodename (case of VM deletion), adding a node requires the steps outlined above.
we use puppet certs for the etcd client connection
we enforce client certs on etcd server side

I went ahead and tried this:

root@toolsbeta-test-k8s-master-1:~# kubeadm upgrade plan
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/versions] Latest stable version: v1.15.0
[upgrade/versions] Latest version in the v1.15 series: v1.15.0

Awesome, you're up-to-date! Enjoy!

So basically, we are at the latest still. The docs say kubeadm can be used to downgrade, but it provides no guidance and the tooling seems...not so good for that. If we want to test upgrading for whatever reason, which seems like a much more straightforward process than most of what we've done, we'd need to deploy a cluster with v1.14.4, then upgrade to v1.15.0. Kubeadm upgrade behaves differently in the 1.15 series, though. It refreshes all node certs as it upgrades, so it will not necessarily predict how upgrades will behave in the future. I suspect we may be better off trying out upgrades in beta when a new release happens (1.15.1).

I say that partly because we have a lot of work to do to get this "toolforge ready" now that we've got a handle on a process for kubeadm itself.

Mentioned in SAL (#wikimedia-cloud) [2019-07-17T09:13:42Z] <arturo> create VM toolsbeta-test-k8s-master-4 (Debian Buster) T215531

Mentioned in SAL (#wikimedia-cloud) [2019-07-17T09:51:30Z] <arturo> re-create VM toolsbeta-test-k8s-worker-1 as Debian Buster T215531

aborrero closed subtask T228267: Toolforge: iptables flavor for Debian Buster-based k8s cluster as Resolved.Jul 18 2019, 11:58 AM

Change 524281 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: include the kubeadm_docker_service

https://gerrit.wikimedia.org/r/524281

Change 524281 merged by Bstorm:
[operations/puppet@production] toolforge: include the kubeadm_docker_service

https://gerrit.wikimedia.org/r/524281

Ok, the cluster is now using PSP on init, and it works fine. I have no idea what caused our problem before, but a clean rebuild works great.

Since this works perfectly now (for whatever reason--I have theories that don't ultimately matter much now), the final form of the build process now looks like this:

In T215531#5337779, @aborrero wrote:
We just tested the lifecycle again, and it seems to work:
root@toolsbeta-test-k8s-master-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@toolsbeta-test-k8s-master-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config

Right here, before calico, you need to run:

kubectl apply -f /etc/kubernetes/kubeadm-system-psp.yaml

That will bring the admin pods online and allow calico to spin up as well. No other pods will be permitted unless they are in kube-system until we add another manifest to handle the toolforge pods. That's the topic of T227290, though.

root@toolsbeta-test-k8s-master-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]

For other control plane nodes:
```lang=shell-session
root@toolsbeta-test-k8s-master-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
0e323a45a4212c78994e30f8f3b9a6f77a1b475e696e12e7bf5f7cbd72ea5871
root@toolsbeta-test-k8s-master-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
3637ded9d0ac4e45952214e43b3107055d090ea0c13a176c4607f907662034f1

root@toolsbeta-test-k8s-master-2:~# kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output> --experimental-control-plane --certificate-key <upload_certs_output>
[...]

For worker nodes:

aborrero@toolsbeta-test-k8s-worker-1:~ $ sudo kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output>

Note that:

deleting a node requires kubectl delete node <nodename (case of VM deletion), adding a node requires the steps outlined above.
we use puppet certs for the etcd client connection
we enforce client certs on etcd server side

Huge progress :)

Change 524310 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove class redeclaration

https://gerrit.wikimedia.org/r/524310

To explain this patch and the one where I changed the docker service class:
The docker service class being left out of master since it was easy to forget. I made it an include at the module level (to make the module functional and internally consistent) instead of declaring it in class context in the profile. Separating it out like that is how we manage roles to keep them flexible (which I get), but doing it at the module level makes modules require unusual quirks and insider knowledge just to make them work. Modules are developed elsewhere with a primary init.pp gateway that accepts all options, with most else configured by that interface. I'm fine not using the init pattern in modules, but I'd rather not make it more confusing as well by splitting it out too much.

I'm open to discussion, but I am changing the node profile so that it will work (what I did broke the node profile...but not the master one because it was forgotten there). That's just so it isn't left in a broken state because of how I changed it. I caught the missing material because of warnings during the init preflight phase about the docker config being missing. --So you don't think I'm just being picky or weird about it @aborrero :)

Change 524310 merged by Bstorm:
[operations/puppet@production] toolforge: remove class redeclaration

https://gerrit.wikimedia.org/r/524310

• Bstorm mentioned this in T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy.Jul 18 2019, 8:27 PM

ok, works for me :-)

aborrero closed subtask T228660: Toolforge: new k8s: issues with the initial coredns setup as Resolved.Jul 22 2019, 5:18 PM

Change 525112 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: kubadm: calico requires ipset

https://gerrit.wikimedia.org/r/525112

Change 525112 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: kubadm: calico requires ipset

https://gerrit.wikimedia.org/r/525112

Change 525339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: set kubeadm to use internal registry for pause container

https://gerrit.wikimedia.org/r/525339

Mentioned in SAL (#wikimedia-cloud) [2019-07-24T20:48:19Z] <bstorm_> rebuilt toolsbeta-test cluster with the internal version of the pause container T228887 T215531

Change 525339 merged by Bstorm:
[operations/puppet@production] toolforge: set kubeadm to use internal registry for pause container

https://gerrit.wikimedia.org/r/525339

Change 525434 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: add internal pause container to all the other kubelets

https://gerrit.wikimedia.org/r/525434

Change 525434 merged by Bstorm:
[operations/puppet@production] toolforge: add internal pause container to all the other kubelets

https://gerrit.wikimedia.org/r/525434

Change 525436 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: fix typo kubelet file content

https://gerrit.wikimedia.org/r/525436

In the end this works, however, only the init config and presumably a join config file accept the new pause container gracefully. The other control plane nodes (which cannot use a config) require this to be appended to the end of the mess. Luckily, later options overwrite earlier ones, so as soon as the node reboots (or docker & kubelet restart), it works regardless of having two conflicting CLI args on the kubelet command. This works, though, and it is consistent. The only design difference we could do in future might be to use a join config for non-control-plane nodes.

Change 525436 merged by Bstorm:
[operations/puppet@production] toolforge: fix typo kubelet file content

https://gerrit.wikimedia.org/r/525436

• Bstorm closed subtask T228887: Update pause container in our internal registry as Resolved.Jul 24 2019, 10:07 PM

Ok, great news, we can try a kubeadm upgrade now.

# kubeadm upgrade plan
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/versions] Latest stable version: v1.15.1
[upgrade/versions] Latest version in the v1.15 series: v1.15.1

External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT   AVAILABLE
Etcd        3.2.26    3.3.10

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       AVAILABLE
Kubelet     5 x v1.15.0   v1.15.1

Upgrade to the latest version in the v1.15 series:

COMPONENT            CURRENT   AVAILABLE
API Server           v1.15.0   v1.15.1
Controller Manager   v1.15.0   v1.15.1
Scheduler            v1.15.0   v1.15.1
Kube Proxy           v1.15.0   v1.15.1
CoreDNS              1.3.1     1.3.1

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.15.1

Note: Before you can perform this upgrade, you have to update kubeadm to v1.15.1.

_____________________________________________________________________

We should not be required to upgrade etcd, but it will probably tells us about any time we do this. Since this is a great testing opportunity, I'm running it.

Interestingly (but not surprisingly), it asks that first we upgrade kubeadm.

# kubeadm upgrade apply v1.15.1
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/version] You have chosen to change the cluster version to "v1.15.1"
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/version] FATAL: the --version argument is invalid due to these errors:

        - Specified version to upgrade to "v1.15.1" is higher than the kubeadm version "v1.15.0". Upgrade kubeadm first using the tool you used to install kubeadm

Can be bypassed if you pass the --force flag

To test upgrade, first we'll have to update that from upstream (though it might work with --force). As is, this will still install kubernetes 1.15.0 on kubeadm init because of our config even if we update kubeadm.

@aborrero if you are bored with fighting with the ingress for a bit and want to test this, we just have to update our repo from upstream...however that is done :) I presume that isn't terribly hard? It's not a requirement for this whole thing, but it would be very good to know how "bad" upgrades will be.

Mentioned in SAL (#wikimedia-operations) [2019-07-25T11:03:19Z] <arturo> update stretch-wikimedia/thirdparty/kubeadm-k8s on install1002 for T215531 (kubeadm 1.15.1)

@Bstorm here you go:

aborrero@toolsbeta-test-k8s-master-1:~$ apt-cache policy kubeadm
kubeadm:
  Installed: 1.15.0-00
  Candidate: 1.15.1-00
  Version table:
     1.15.1-00 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 Packages
 *** 1.15.0-00 100
        100 /var/lib/dpkg/status

Just recording the process as I go here:

# apt install kubeadm
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libopts25 libpcsclite1 python3-debconf
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  cri-tools
The following packages will be upgraded:
  cri-tools kubeadm
2 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.
Need to get 17.0 MB of archives.
After this operation, 2,250 kB disk space will be freed.
Do you want to continue? [Y/n] 
Get:1 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 cri-tools amd64 1.13.0-00 [8,776 kB]
Get:2 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 kubeadm amd64 1.15.1-00 [8,247 kB]
Fetched 17.0 MB in 1s (32.4 MB/s)
(Reading database ... 57148 files and directories currently installed.)
Preparing to unpack .../cri-tools_1.13.0-00_amd64.deb ...
Unpacking cri-tools (1.13.0-00) over (1.12.0-00) ...
Preparing to unpack .../kubeadm_1.15.1-00_amd64.deb ...
Unpacking kubeadm (1.15.1-00) over (1.15.0-00) ...
Setting up cri-tools (1.13.0-00) ...
Setting up kubeadm (1.15.1-00) ...

# kubeadm upgrade apply v1.15.1
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/version] You have chosen to change the cluster version to "v1.15.1"
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.1
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]:

And I confirmed:

[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/prepull] Prepulling image for component kube-scheduler.
[upgrade/prepull] Prepulling image for component kube-apiserver.
[upgrade/prepull] Prepulling image for component kube-controller-manager.
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-apiserver
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[upgrade/prepull] Prepulled image for component kube-scheduler.
[upgrade/prepull] Prepulled image for component kube-controller-manager.
[upgrade/prepull] Prepulled image for component kube-apiserver.
[upgrade/prepull] Successfully prepulled the images for all the control plane components
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.15.1"...
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: e7a689bf231e30af59efcb56690b440d
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests422342376"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: e7a689bf231e30af59efcb56690b440d
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: 81e3015017da0b319ec4e8fce4116aae
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 645e7a8519364c082c136bba3c26849b
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: ecae9d12d3610192347be3d1aa5aa552
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.15" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.15.1". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
root@toolsbeta-test-k8s-master-1:~#

After that, the kubelets are, of course, not yet upgraded:

# kubectl get nodes -o wide
NAME                          STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
toolsbeta-test-k8s-master-1   Ready    master   18h   v1.15.0   172.16.2.223   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-master-2   Ready    master   18h   v1.15.0   172.16.2.225   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-master-3   Ready    master   17h   v1.15.0   172.16.2.233   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-worker-1   Ready    <none>   18h   v1.15.0   172.16.2.227   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-worker-2   Ready    <none>   18h   v1.15.0   172.16.2.231   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7

And the effect of it:

root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

So we can see that it only updated the control plane node that it ran on.

It is necessary and documented for HA clusters that you must go to the other nodes directly to run the following:

root@toolsbeta-test-k8s-master-2:~# kubeadm upgrade node 
[upgrade] Reading configuration from the cluster...
[upgrade] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Upgrading your Static Pod-hosted control plane instance to version "v1.15.1"...
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 7c5b672d7da21ab872a88c8feec039ea
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests191401975"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 7c5b672d7da21ab872a88c8feec039ea
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 17c3be5ae16d141c9a5708dfc1a87b8e
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 645e7a8519364c082c136bba3c26849b
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: ecae9d12d3610192347be3d1aa5aa552
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade] The control plane instance for this node was successfully updated!
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[upgrade] The configuration for this node was successfully updated!
[upgrade] Now you should go ahead and upgrade the kubelet package using your package manager.

Note that you no longer need to specify "control-plane" or "experimental-control-plane" because that is a phase of the command by default in version 1.15+. If there are control plane pods, it upgrades them.

Now upgrading the package side of things in general on the control plane nodes one at a time. This brings up an interesting point. We should pin or hold the packages at a particular version until we are ready to upgrade in the future, possibly keying off the value from our kubeadm config to set those things.

If the specific packages that are in our repo are manually controlled, perhaps there's no need to mess with it in puppet/apt, though 😁

Change 525569 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: Update the version string to match our software

https://gerrit.wikimedia.org/r/525569

root@toolsbeta-test-k8s-master-1:~# kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
toolsbeta-test-k8s-master-1   Ready    master   19h   v1.15.1
toolsbeta-test-k8s-master-2   Ready    master   19h   v1.15.1
toolsbeta-test-k8s-master-3   Ready    master   18h   v1.15.1
toolsbeta-test-k8s-worker-1   Ready    <none>   18h   v1.15.0
toolsbeta-test-k8s-worker-2   Ready    <none>   18h   v1.15.0
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

After that, it's just kubelet upgrades for the worker nodes. That should be done with drains to minimize disruption. Overall, that makes for a procedure we can document. Naturally, the process for upgrading between major versions is more involved, but the documented upgrades in the official docs are remarkably similar to this procedure, which is good to see.

Change 525569 merged by Bstorm:
[operations/puppet@production] toolforge: Update the version string to match our software

https://gerrit.wikimedia.org/r/525569

Fully thing, a lot of what is fixed in 1.15.1 is the things that annoyed us about etcd and kubeadm for an HA stacked control plane: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#changelog-since-v1150

One notable thing about the upgrade process as well: it rotates the certificates so they don't expire. Renewing all the certs is an often cited issue. If we keep up with upgrades, we honestly will never have to worry about it.

In T215531#5365430, @Bstorm wrote:

If the specific packages that are in our repo are manually controlled, perhaps there's no need to mess with it in puppet/apt, though 😁

This approach may not work well once we have N clusters (toolsbeta, tools, anything in codfw that we might add for additional testing) and want to practice an upgrade on clusterA without needing to freeze apt upgrades or capacity expansion in clustersB...N. As long as we are using an apt repo with support for multiple versions of the same package (I think aptly has this restriction?) then pining or explicit versioning in the Puppet manifests should let us run version n+1 in a test cluster without breaking the use of version n in other clusters.

This is true. We are using reprepro, not aptly for packages. I have no idea if we can support multiple package versions in that. The kubernetes API version will not upgrade until told to via kubeadm, but the kubelet must be upgraded by hand (which is what the pinning affects--and the updates are not done by puppet though a new node build would be affected by package changes). As is, we have the version as a configurable field that can be hiera'd for kubeadm init. After init, it makes no difference unless we then also use it to manage the package versions (and kubelet version isn't managed by kubeadm).

Overall, it boils down the question: is it possible to have multiple versions in reprepro or not.

In T215531#5366759, @Bstorm wrote:

Overall, it boils down the question: is it possible to have multiple versions in reprepro or not.

Yes! it is possible :-)

We have several ways of doing it, but the easier I would say is to just create versioned repo components.

Currently we have:

stretch-wikimedia/thirdparty/kubeadm-k8s

We could move to:

stretch-wikimedia/thirdparty/kubeadm-k8s-1.15

Anyway I suggest we create another task to discuss the details.

Change 519375 abandoned by Arturo Borrero Gonzalez:
k8s: kubelet: stop requiring ::k8s::infrastructure_config

Reason:
Not following this approach anymore.

https://gerrit.wikimedia.org/r/519375

01tonythomas subscribed.Aug 18 2019, 2:37 PM

01tonythomas mentioned this in T226052: Google OAuth verification for tools require domain verification.Aug 28 2019, 6:29 PM

Krenair subscribed.Oct 12 2019, 12:14 PM

Change 543815 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refresh puppet code for the new k8s

https://gerrit.wikimedia.org/r/543815

Change 543815 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refresh puppet code for the new k8s

https://gerrit.wikimedia.org/r/543815

• Bstorm closed subtask T234702: Review and establish configurable quotas for users in the new Kubernetes cluster as Resolved.Oct 22 2019, 6:57 PM

• Bstorm closed subtask T229058: Replace the nslcd mount in containers from the old Toolforge cluster with something that will work with sssd in the new one as Resolved.

aborrero closed subtask T228500: Toolforge: evaluate ingress mechanism as Resolved.Oct 23 2019, 3:18 PM

aborrero closed subtask T229009: Proposal: ditching the master name in kubernetes servers as Resolved.Oct 23 2019, 3:41 PM

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T23:41:32Z] <bstorm_> Deployed custom webhook controllers for registry and ingress checking to toolsbeta-test kubernetes cluster T215531 T215678 T234231

Stashbot mentioned this in T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup.Oct 25 2019, 11:41 PM

Stashbot mentioned this in T234231: Toolforge ingress: decide on how ingress configuration objects will be managed.

aborrero closed subtask T236249: Toolforge: new k8s: upload internal docker images to our registry as Resolved.Oct 29 2019, 1:12 PM

aborrero closed subtask T236074: Toolforge: rebuild the new k8s toolsbeta deployment and write final docs as Resolved.

Change 547668 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] deploy: prepare for deployment in toolsbeta

https://gerrit.wikimedia.org/r/547668

• Bstorm closed subtask T227290: Design and document how to integrate the new Toolforge k8s cluster with PodSecurityPolicy as Resolved.Oct 31 2019, 10:19 PM

aborrero closed subtask T236824: Toolforge: new k8s: get new deb packages for 1.15.4 or 1.15.5 as Resolved.Nov 4 2019, 11:57 AM

Change 547668 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] deploy: prepare for deployment in toolsbeta

https://gerrit.wikimedia.org/r/547668

• Bstorm mentioned this in rLTMKe9d42fc5eff9: deploy: prepare for deployment in toolsbeta.Nov 5 2019, 10:05 PM

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T22:50:56Z] <bstorm_> deployed the new maintain-kubeusers to toolsbeta T215531 T228499

Stashbot mentioned this in T228499: Toolforge: changes to maintain-kubeusers.Nov 5 2019, 10:50 PM

aborrero closed subtask T237443: toolsbeta: new k8s: deploy a front proxy (dynamicproxy) as Resolved.Nov 6 2019, 1:24 PM

Change 549108 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: Distribute the roles for toolforge users

https://gerrit.wikimedia.org/r/549108

• Bstorm added a subtask: T237541: CoreDNS in the new k8s cluster cannot talk to the Cloud recursors.Nov 6 2019, 4:18 PM

Change 549108 merged by Bstorm:
[operations/puppet@production] toolforge: Distribute the roles for toolforge users

https://gerrit.wikimedia.org/r/549108

Redeployed maintain-kubeusers in toolsbeta:

root@toolsbeta-test-k8s-control-1:/home/bstorm/maintain-kubeusers# kubectl logs maintain-kubeusers-7b6bb8f79d-xc9qb -n maintain-kubeusers
starting a run
Homedir already exists for /data/project/toolschecker
Wrote config in /data/project/toolschecker/.kube/config
PodSecurityPolicy tool-toolschecker-psp already exists
Provisioned creds for tool toolschecker
Homedir already exists for /data/project/admin
Wrote config in /data/project/admin/.kube/config
PodSecurityPolicy tool-admin-psp already exists
Provisioned creds for tool admin
Homedir already exists for /data/project/test2
Wrote config in /data/project/test2/.kube/config
PodSecurityPolicy tool-test2-psp already exists
Provisioned creds for tool test2
Homedir already exists for /data/project/test
Wrote config in /data/project/test/.kube/config
PodSecurityPolicy tool-test-psp already exists
Provisioned creds for tool test
finished run, wrote 4 new accounts

Now, we have tools to test with!

It created working configs so far. Will try migrating a tool today in toolsbeta.

Change 549201 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/docker-images/toollabs-images@master] jessie fixes: port the fix from the base image to the jessie-sssd one

https://gerrit.wikimedia.org/r/549201

Change 549201 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] jessie fixes: port the fix from the base image to the jessie-sssd one

https://gerrit.wikimedia.org/r/549201

• Bstorm mentioned this in rODIT849798dbbcfb: jessie fixes: port the fix from the base image to the jessie-sssd one.Nov 6 2019, 9:08 PM

Mentioned in SAL (#wikimedia-cloud) [2019-11-06T21:33:29Z] <bstorm_> docker images needed for kubernetes cluster upgrade deployed T215531

Mentioned in SAL (#wikimedia-cloud) [2019-11-06T22:39:00Z] <bstorm_> upgraded repo version of toollabs-webservice in toolsbeta-stretch to 0.49 -- changes for the new k8s cluster T215531

Change 549613 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549616 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[cloud/toolforge/ingress-admission-controller@master] toolsbeta: allow the host toolsbeta.wmflabs.org

https://gerrit.wikimedia.org/r/549616

Change 549616 merged by Bstorm:
[cloud/toolforge/ingress-admission-controller@master] toolsbeta: allow the host toolsbeta.wmflabs.org

https://gerrit.wikimedia.org/r/549616

Mentioned in SAL (#wikimedia-cloud) [2019-11-07T21:55:15Z] <bstorm_> killed pods for ingress admission controller to upgrade to new image T215531

Change 549613 merged by Bstorm:
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549921 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] toolforge: Rename to toolforge-tool-role.yaml due to typo

https://gerrit.wikimedia.org/r/549921

Change 549921 merged by Phamhi:
[operations/puppet@production] toolforge: Rename to toolforge-tool-role.yaml due to typo

https://gerrit.wikimedia.org/r/549921

• Bstorm mentioned this in T237789: Document (and execute) the upgrade process for the new Toolforge K8s cluster.Nov 9 2019, 12:39 AM

@aborrero I have noticed a strange behavior in the new proxy in toolsbeta. If I spin up new tools on the old cluster, they are sometimes unreachable over the flannel IP until I reboot the proxy server (!?!). Restarting flannel did not help, only reboot. I also saw it return when I took a service on the new cluster and put it back on the old cluster.

I'm not sure what I'm seeing...if there's some kind of caching going on or what. I did notice that I was able to see a webservice I had stopped as though it was running against the ingress in the new cluster. I'm not really sure what was happening, but it is terribly weird, especially since *some* services (the admin tool) were still reachable over flannel on the old cluster, but not new ones.

It makes me a little concerned about how things will act on deploy in tools if we cannot explain what is happening. I'll try to help troubleshoot this while I am on travel.

01tonythomas mentioned this in T125589: Allow each tool to have its own subdomain for browser sandbox/cookie isolation.Nov 17 2019, 5:51 PM

I may have found a reason for that behavior. I had stopped kube-proxy on the toolsbeta proxy because it was malfunctioning, but that would also stop it from updating the nat table in iptables. THAT would confuse the service lookup mechanism, which appears to be used by dynamicproxy. I have some opinions on that, but either way, it seems like getting that working might make the problem go away.

(Been kicking around k8s networking a lot here at KubeCon, and it made me realize I should check that)

@aborrero That did it! It works. That problem is not a problem and this process works to migrate to the new cluster (and back): https://wikitech.wikimedia.org/wiki/User:Bstorm/New_k8s_migration 💥

Proof:

toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice stop
Stopping webservice
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl config use-context toolforge
switched to context "toolforge".
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice --backend kubernetes python3.7 start
Starting webservice......
toolsbeta.test@toolsbeta-sgebastion-04:~$ /usr/bin/kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
test-85d69fb4f9-jxq68   1/1     Running   0          16s
toolsbeta.test@toolsbeta-sgebastion-04:~$ curl http://toolsbeta.wmflabs.org/test/
Hello World, from Toolsbeta!
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice stop
Stopping webservice
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl config use-context default
switched to context "default".
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice --backend kubernetes python3.7 start
# ** warning trimmed **
Starting webservice.....
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl get pods
NAME                   READY     STATUS    RESTARTS   AGE
test-603267139-14lo9   1/1       Running   0          19s
toolsbeta.test@toolsbeta-sgebastion-04:~$ curl http://toolsbeta.wmflabs.org/test/
Hello World, from Toolsbeta!

It flows back and forth quite seemlessly and quickly (at the tiny scale of toolsbeta).

great! thanks!

aborrero removed a subscriber: • chasemp.Nov 19 2019, 10:20 AM

aborrero closed subtask T238655: toolforge: new k8s: issues with the apiserver and etcd as Resolved.Nov 25 2019, 10:49 AM

I think the upgraded k8s cluster in toolbeta has been up and running stable for some time now. Resolving this task with the hope we can better focus on the several subtasks we have previous to the final operations in the tools project.

• Bstorm closed subtask T228499: Toolforge: changes to maintain-kubeusers as Resolved.Feb 4 2020, 9:26 PM

aborrero closed subtask T238641: toolforge: some additional testing before final migration as Declined.Mar 2 2020, 5:19 PM

Deploy upgraded Kubernetes to toolsbetaClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy upgraded Kubernetes to toolsbeta
Closed, ResolvedPublic
Actions

Related Objects
Search...