Page MenuHomePhabricator

Update staging-codfw to k8s 1.23
Closed, ResolvedPublic

Description

Update staging-codfw

sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw'

sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'"

Reimage etcd hosts

Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047

sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003

...etcd needed a manual restart (on 2003 at least) to pick up certs.

etcd v2 and v3 healthy?
etcdctl -C https://$(hostname -f):2379 cluster-health
ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list

Reimage master & nodes

https://gerrit.wikimedia.org/r/877990

sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001
sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001
sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002

In-cluster components

https://gerrit.wikimedia.org/r/868389
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 no longer required, included in new coredns chart version

# Label master(s)
kube_env admin staging-codfw
kubectl label nodes kubestagemaster2001.codfw.wmnet node-role.kubernetes.io/master=""

helmfile -e staging-codfw -l name=rbac-rules -i apply
helmfile -e staging-codfw -l name=pod-security-policies -i apply
helmfile -e staging-codfw -l name=namespaces -i apply
helmfile -e staging-codfw -l name=calico-crds -i apply
helmfile -e staging-codfw -l name=calico -i apply
kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS
helmfile -e staging-codfw -l name=coredns -i apply
helmfile -e staging-codfw -l name=calico -i apply # to get calico-typha service back, coredns should probably go before calico
helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply
istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml
helmfile -e staging-codfw -l name=eventrouter -i apply
helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply
helmfile -e staging-codfw -l name=cert-manager -i apply
helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply
helmfile -e staging-codfw -l name=cfssl-issuer -i apply
helmfile -e staging-codfw -i apply

Todos:

Alerts that where still firing

Event Timeline

Change 877990 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s: Update staging-codfw to kubernetes 1.23

https://gerrit.wikimedia.org/r/877990

Icinga downtime and Alertmanager silence (ID=eff8a645-166c-412e-8f27-b7169d6aa830) set by jayme@cumin1001 for 1 day, 0:00:00 on 6 host(s) and their services with reason: Reinitialize staging-codfw with k8s 1.23

kubestage[2001-2002].codfw.wmnet,kubestagemaster2001.codfw.wmnet,kubestagetcd[2001-2003].codfw.wmnet

Cookbook cookbooks.sre.ganeti.reimage was started by jayme@cumin1001 for host kubestagetcd2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jayme@cumin1001 for host kubestagetcd2001.codfw.wmnet with OS bullseye executed with errors:

  • kubestagetcd2001 (FAIL)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Change 878047 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] install_server: Update kubestagetcd2* to bullseye

https://gerrit.wikimedia.org/r/878047

Change 878047 merged by JMeybohm:

[operations/puppet@production] install_server: Update kubestagetcd2* to bullseye

https://gerrit.wikimedia.org/r/878047

Change 877990 merged by JMeybohm:

[operations/puppet@production] k8s: Update staging-codfw to kubernetes 1.23

https://gerrit.wikimedia.org/r/877990

Change 878184 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] staging-codfw: Update coredns to 1.8.7-1

https://gerrit.wikimedia.org/r/878184

Change 878184 merged by jenkins-bot:

[operations/deployment-charts@master] staging-codfw: Update coredns to 1.8.7-1

https://gerrit.wikimedia.org/r/878184

Change 878190 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add istio config for main/wikikube clusters on k8s 1.23

https://gerrit.wikimedia.org/r/878190

Change 878752 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] cert-manager: Set leader election namespace to cert-manager

https://gerrit.wikimedia.org/r/878752

JMeybohm updated the task description. (Show Details)
JMeybohm added a subscriber: elukey.

Change 878752 abandoned by JMeybohm:

[operations/deployment-charts@master] cert-manager: Set leader election namespace to cert-manager

Reason:

https://gerrit.wikimedia.org/r/878752

Change 878190 merged by jenkins-bot:

[operations/deployment-charts@master] Add istio config for main/wikikube clusters on k8s 1.23

https://gerrit.wikimedia.org/r/878190

Change 879063 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] staging-codfw: Unpin eventrouter, helm-state-metrics, coredns

https://gerrit.wikimedia.org/r/879063

Change 879063 merged by jenkins-bot:

[operations/deployment-charts@master] staging-codfw: Unpin eventrouter, helm-state-metrics, coredns

https://gerrit.wikimedia.org/r/879063

Change 879112 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Don't pin image version of coredns

https://gerrit.wikimedia.org/r/879112

Change 879112 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Don't pin image version of coredns

https://gerrit.wikimedia.org/r/879112

JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)

Moved the rest of the open action items to T328291: Post Kubernetes v1.23 cleanup.
The "failed to update managedFields" I've not seen again. This seems to happen only once during the initial join of the node.