[x] Downtime: etcd, master, nodes
[x] Reimage etcd nodes with bullseye
[x] Merge hiera changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877990/2
[x] Reimage master
[x] Reimage nodes
[x] Verify basic k8s stuff working (nodes joining the cluster)
[x] Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868389
[x] Deploy admin_ng & istio
[x] Deploy services (only miscweb so far)
[x] Lift downtimes
## Update staging-codfw
```
sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw'
sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'"
```
### Reimage etcd hosts
Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047
```
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003
```
...etcd needed a manual restart (on 2003 at least) to pick up certs.
#### etcd v2 and v3 healthy?
```
etcdctl -C https://$(hostname -f):2379 cluster-health
ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list
```
### Reimage master & nodes
https://gerrit.wikimedia.org/r/877990
```
sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001
sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001
sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002
```
### In-cluster components
https://gerrit.wikimedia.org/r/868389
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190
~~https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 ~~ no longer required, included in new coredns chart version
```
helmfile -e staging-codfw -l name=rbac-rules -i apply
helmfile -e staging-codfw -l name=pod-security-policies -i apply
helmfile -e staging-codfw -l name=namespaces -i apply
helmfile -e staging-codfw -l name=calico-crds -i apply
helmfile -e staging-codfw -l name=calico -i apply
kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS
helmfile -e staging-codfw -l name=coredns -i apply
helmfile -e staging-codfw -l name=calico sync # to get calico-typha service back, coredns should probably go before calico
helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply
istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml
helmfile -e staging-codfw -l name=eventrouter -i apply
helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply
helmfile -e staging-codfw -l name=cert-manager -i apply
helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply
helmfile -e staging-codfw -l name=cfssl-issuer -i apply
helmfile -e staging-codfw -i apply
```
### Todos:
[ ] Jan 10 18:33:42 kubestagemaster2001 kube-apiserver[1833]: E0110 18:33:42.162058 1833 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="" name="kubestage2001.codfw.wmnet"
[x] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters
[x] ~~New default namespace apart from kube-system: kube-node-lease, kube-public, default - do they need to be protected in admin_ng?~~ kube- prefixed namespaces are protected
[x] Istio: ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See http://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead
[x] coredns: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878935
[ ] .Values.kubernetesApi hacks should no longer be needed; T326729
[ ] Remove obsolete tokens from private puppet and labs/private (at least the following):
[ ] profile::kubernetes::master::controllermanager_token
[ ] profile::kubernetes::node::kubelet_token
[ ] profile::kubernetes::node::kubeproxy_token
[ ] Remove cergen certificates
[ ] Fix grafana dashboards that are in bad shape; T322919
### Alerts that where still firing
* [10.01.23 19:06] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
* [10.01.23 19:07] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
* [10.01.23 19:07] <jinxer-wm> (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
* Various JobUnavailable, I ended up creating a silence with matchers: `source="prometheus"prometheus="k8s-staging"site="codfw"`