Earlier today there was a page for wikifunctions and also other non-paging wikikube codfw services alerted (like miscweb in T353211).
It seems there was a issue between Dec 11 23:55 UTC and Dec 12 00:17 UTC. See network probes:
https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=1702338398590&to=1702340802577
The Kubernetes event logs show a lot of unhealthy pods on kubernetes2047
From skimming through at the syslog of kubernetes2047 the node got a lot of timeouts when trying to reach the kubemaster in codfw before the incident. Then calico crashlooped quite some time during the incident. Then there also was some oom-killing starting at Dec 12 00:13 UTC.
Before the incident (Dec 11 22:39:28) there also was a OOM kill and a significant increase in TCP errors. That's probably related to the timeouts for the kubemaster. But maybe it's also just eventrouter getting overwhelmed by to many Kubernetes events. Maybe somebody can connect the dots :)
Thanks @JMeybohm for helping on IRC identifying this.
From calico logs we can see typha failing to connect to the apiservers around 22:36 (2023-12-11): https://logstash.wikimedia.org/goto/2d45ae99d7fe495907ba1252216e7aac
Around that time both apiservers where unreachable from prometheus as well (gap in metrics): kubemaster2001 / kubemaster2002
- mediawiki train/backportt ~22:15
- elevated api requests 22:20 & 22:35
- high disk IO on masters, probably due to logging
- we're running upper bound on memory
- 2001 oomk at 23:17 and 23:41, second one killed kube-apiserver
- 2002 oom killed kube-apiserver ~23:50