This will become handy for further kubernetes updates where we might just go the "reinit-route" again.
This cookbook should:
- downtime all services (in the cluster, for its DC) (T277740)
- route all service traffic to "the other DC" (T260663)
- downtime all masters and nodes
- schedule various other downtimes, for like:
- PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers acrab.codfw.wmnet are marked down but pooled. Use https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1016&service=PyBal+backends+health+check and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs1015&service=PyBal+backends+health+check
- PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=k8s+cache#
- PROBLEM - Confd template for /srv/config-master/pybal/eqiad/... on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/... is broken. Use https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal# (this needs to be downtimed for both DCs!)
- depool all nodes from pybal