Change Details

[x] Downtime: etcd, master, nodes [x] Reimage etcd nodes with bullseye [x] Merge hiera changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877990/2 [x] Reimage master [x] Reimage nodes [x] Verify basic k8s stuff working (nodes joining the cluster) [x] Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868389 [x] Deploy admin_ng & istio [x] Deploy services (only miscweb so far) [x] Lift downtimes ## Update staging-codfw ``` sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'" ``` ### Reimage etcd hosts Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003 ``` ...etcd needed a manual restart (on 2003 at least) to pick up certs. #### etcd v2 and v3 healthy? ``` etcdctl -C https://$(hostname -f):2379 cluster-health ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list ``` ### Reimage master & nodes https://gerrit.wikimedia.org/r/877990 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002 ``` ### In-cluster components https://gerrit.wikimedia.org/r/868389 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190 ~~https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 ~~ no longer required, included in new coredns chart version ``` helmfile -e staging-codfw -l name=rbac-rules -i apply helmfile -e staging-codfw -l name=pod-security-policies -i apply helmfile -e staging-codfw -l name=namespaces -i apply helmfile -e staging-codfw -l name=calico-crds -i apply helmfile -e staging-codfw -l name=calico -i apply kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS helmfile -e staging-codfw -l name=coredns -i apply helmfile -e staging-codfw -l name=calico sync # to get calico-typha service back, coredns should probably go before calico helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml helmfile -e staging-codfw -l name=eventrouter -i apply helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply helmfile -e staging-codfw -l name=cert-manager -i apply helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply helmfile -e staging-codfw -l name=cfssl-issuer -i apply helmfile -e staging-codfw -i apply ``` ### Todos: [ ] Jan 10 18:33:42 kubestagemaster2001 kube-apiserver[1833]: E0110 18:33:42.162058 1833 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="" name="kubestage2001.codfw.wmnet" [x] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters [x] ~~New default namespace apart from kube-system: kube-node-lease, kube-public, default - do they need to be protected in admin_ng?~~ kube- prefixed namespaces are protected [x] Istio: ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See http://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead [x] coredns: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878935 [ ] .Values.kubernetesApi hacks should no longer be needed; T326729 [ ] Remove obsolete tokens from private puppet and labs/private (at least the following): [ ] profile::kubernetes::master::controllermanager_token [ ] profile::kubernetes::node::kubelet_token [ ] profile::kubernetes::node::kubeproxy_token [ ] Remove cergen certificates [ ] Fix grafana dashboards that are in bad shape; T322919 ### Alerts that where still firing * [10.01.23 19:06] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <jinxer-wm> (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown * Various JobUnavailable, I ended up creating a silence with matchers: `source="prometheus"prometheus="k8s-staging"site="codfw"`

[x] Downtime: etcd, master, nodes [x] Reimage etcd nodes with bullseye [x] Merge hiera changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877990/2 [x] Reimage master [x] Reimage nodes [x] Verify basic k8s stuff working (nodes joining the cluster) [x] Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868389 [x] Deploy admin_ng & istio [x] Deploy services (only miscweb so far) [x] Lift downtimes ## Update staging-codfw ``` sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'" ``` ### Reimage etcd hosts Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003 ``` ...etcd needed a manual restart (on 2003 at least) to pick up certs. #### etcd v2 and v3 healthy? ``` etcdctl -C https://$(hostname -f):2379 cluster-health ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list ``` ### Reimage master & nodes https://gerrit.wikimedia.org/r/877990 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002 ``` ### In-cluster components https://gerrit.wikimedia.org/r/868389 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190 ~~https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 ~~ no longer required, included in new coredns chart version ``` helmfile -e staging-codfw -l name=rbac-rules -i apply helmfile -e staging-codfw -l name=pod-security-policies -i apply helmfile -e staging-codfw -l name=namespaces -i apply helmfile -e staging-codfw -l name=calico-crds -i apply helmfile -e staging-codfw -l name=calico -i apply kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS helmfile -e staging-codfw -l name=coredns -i apply helmfile -e staging-codfw -l name=calico -i apply # to get calico-typha service back, coredns should probably go before calico helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml helmfile -e staging-codfw -l name=eventrouter -i apply helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply helmfile -e staging-codfw -l name=cert-manager -i apply helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply helmfile -e staging-codfw -l name=cfssl-issuer -i apply helmfile -e staging-codfw -i apply ``` ### Todos: [ ] Jan 10 18:33:42 kubestagemaster2001 kube-apiserver[1833]: E0110 18:33:42.162058 1833 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="" name="kubestage2001.codfw.wmnet" [x] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters [x] ~~New default namespace apart from kube-system: kube-node-lease, kube-public, default - do they need to be protected in admin_ng?~~ kube- prefixed namespaces are protected [x] Istio: ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See http://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead [x] coredns: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878935 [ ] .Values.kubernetesApi hacks should no longer be needed; T326729 [ ] Remove obsolete tokens from private puppet and labs/private (at least the following): [ ] profile::kubernetes::master::controllermanager_token [ ] profile::kubernetes::node::kubelet_token [ ] profile::kubernetes::node::kubeproxy_token [ ] Remove cergen certificates [ ] Fix grafana dashboards that are in bad shape; T322919 ### Alerts that where still firing * [10.01.23 19:06] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <jinxer-wm> (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown * Various JobUnavailable, I ended up creating a silence with matchers: `source="prometheus"prometheus="k8s-staging"site="codfw"`

[x] Downtime: etcd, master, nodes [x] Reimage etcd nodes with bullseye [x] Merge hiera changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877990/2 [x] Reimage master [x] Reimage nodes [x] Verify basic k8s stuff working (nodes joining the cluster) [x] Marge deployment-charts changes for 1.23: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868389 [x] Deploy admin_ng & istio [x] Deploy services (only miscweb so far) [x] Lift downtimes ## Update staging-codfw ``` sudo cookbook sre.hosts.downtime -r 'Reinitialize staging-codfw with k8s 1.23' -t T326340 -H 24 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' sudo cumin 'A:wikikube-staging-etcd-codfw or A:wikikube-staging-master-codfw or A:wikikube-staging-worker-codfw' "disable-puppet 'Reinitialize staging-codfw with k8s 1.23 - T326340 - ${USER}'" ``` ### Reimage etcd hosts Change dhcp pxe config to bullseye: https://gerrit.wikimedia.org/r/c/operations/puppet/+/878047 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2001 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2002 sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagetcd2003 ``` ...etcd needed a manual restart (on 2003 at least) to pick up certs. #### etcd v2 and v3 healthy? ``` etcdctl -C https://$(hostname -f):2379 cluster-health ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list ``` ### Reimage master & nodes https://gerrit.wikimedia.org/r/877990 ``` sudo cookbook -c spicerack_config.yaml sre.ganeti.reimage --no-downtime --os bullseye kubestagemaster2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2001 sudo cookbook sre.hosts.reimage --os bullseye --no-downtime kubestage2002 ``` ### In-cluster components https://gerrit.wikimedia.org/r/868389 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878190 ~~https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878184 ~~ no longer required, included in new coredns chart version ``` helmfile -e staging-codfw -l name=rbac-rules -i apply helmfile -e staging-codfw -l name=pod-security-policies -i apply helmfile -e staging-codfw -l name=namespaces -i apply helmfile -e staging-codfw -l name=calico-crds -i apply helmfile -e staging-codfw -l name=calico -i apply kubectl -n kube-system delete svc calico-typha # it had blocked the ip reserved for CoreDNS helmfile -e staging-codfw -l name=coredns -i apply helmfile -e staging-codfw -l name=calico sync-i apply # to get calico-typha service back, coredns should probably go before calico helmfile -e staging-codfw -l name=istio-gateways-networkpolicies -i apply istioctl-1.15.3 manifest apply -f /srv/deployment-charts/custom_deploy.d/istio/main/config_k8s_1.23.yaml helmfile -e staging-codfw -l name=eventrouter -i apply helmfile -e staging-codfw -l name=cert-manager-networkpolicies -i apply helmfile -e staging-codfw -l name=cert-manager -i apply helmfile -e staging-codfw -l name=cfssl-issuer-crds -i apply helmfile -e staging-codfw -l name=cfssl-issuer -i apply helmfile -e staging-codfw -i apply ``` ### Todos: [ ] Jan 10 18:33:42 kubestagemaster2001 kube-apiserver[1833]: E0110 18:33:42.162058 1833 fieldmanager.go:211] "[SHOULD NOT HAPPEN] failed to update managedFields" VersionKind="/, Kind=" namespace="" name="kubestage2001.codfw.wmnet" [x] https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Label_Kubernetes_Masters [x] ~~New default namespace apart from kube-system: kube-node-lease, kube-public, default - do they need to be protected in admin_ng?~~ kube- prefixed namespaces are protected [x] Istio: ! values.global.jwtPolicy is deprecated; use Values.global.jwtPolicy=third-party-jwt. See http://istio.io/latest/docs/ops/best-practices/security/#configure-third-party-service-account-tokens for more information instead [x] coredns: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/878935 [ ] .Values.kubernetesApi hacks should no longer be needed; T326729 [ ] Remove obsolete tokens from private puppet and labs/private (at least the following): [ ] profile::kubernetes::master::controllermanager_token [ ] profile::kubernetes::node::kubelet_token [ ] profile::kubernetes::node::kubeproxy_token [ ] Remove cergen certificates [ ] Fix grafana dashboards that are in bad shape; T322919 ### Alerts that where still firing * [10.01.23 19:06] <icinga-wm> PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <icinga-wm> PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal * [10.01.23 19:07] <jinxer-wm> (KubernetesCalicoDown) firing: kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown * Various JobUnavailable, I ended up creating a silence with matchers: `source="prometheus"prometheus="k8s-staging"site="codfw"`