- Ensure PKI intermediates have been created
- Downtime: etcd, master, nodes
- Reimage etcd nodes with bullseye
- Merge hiera changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877990/2
- Reimage master
- Reimage nodes
- Verify basic k8s stuff working (nodes joining the cluster)
- Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868389
- Deploy admin_ng & istio
- Deploy services (only miscweb so far)
- Lift downtimes
Update staging-codfw
sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'"
Reimage etcd hosts
Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003
...etcd needed a manual restart (on 2003 at least) to pick up certs.
etcd v2 and v3 healthy?
etcdctl -C https://$(hostname -f):2379 cluster-health ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list
Reimage master & nodes
https://gerrit.wikimedia.org/r/877990
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002
In-cluster components
https://gerrit.wikimedia.org/r/868389
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 no longer required, included in new coredns chart version
# Label master(s) kube_env admin staging-codfw kubectl label nodes kubestagemaster2001.codfw.wmnet node-role.kubernetes.io/master="" helmfile -e staging-codfw -l name=rbac-rules -i apply helmfile -e staging-codfw -l name=pod-security-policies -i apply helmfile -e staging-codfw -l name=namespaces -i apply helmfile -e staging-codfw -l name=calico-crds -i apply helmfile -e staging-codfw -l name=calico -i apply kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS helmfile -e staging-codfw -l name=coredns -i apply helmfile -e staging-codfw -l name=calico -i apply # to get calico-typha service back, coredns should probably go before calico helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml helmfile -e staging-codfw -l name=eventrouter -i apply helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply helmfile -e staging-codfw -l name=cert-manager -i apply helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply helmfile -e staging-codfw -l name=cfssl-issuer -i apply helmfile -e staging-codfw -i apply
Todos:
- Jan 10 18:33:42 kubestagemaster2001 kube-apiserver[1833]: E0110 18:33:42.162058 1833 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="" name="kubestage2001.codfw.wmnet"
- https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters
-
New default namespace apart from kube-system: kube-node-lease, kube-public, default - do they need to be protected in admin_ng?kube- prefixed namespaces are protected - Istio: ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See http://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead
- coredns: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878935
Alerts that where still firing
- [10.01.23 19:06] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
- [10.01.23 19:07] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
- [10.01.23 19:07] <jinxer-wm> (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
- Various JobUnavailable, I ended up creating a silence with matchers: source="prometheus"prometheus="k8s-staging"site="codfw"