Page MenuHomePhabricator

Outage of wikikube codfw apiservers
Closed, ResolvedPublic

Description

Earlier today there was a page for wikifunctions and also other non-paging wikikube codfw services alerted (like miscweb in T353211).

It seems there was a issue between Dec 11 23:55 UTC and Dec 12 00:17 UTC. See network probes:
https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&from=1702338398590&to=1702340802577

The Kubernetes event logs show a lot of unhealthy pods on kubernetes2047

From skimming through at the syslog of kubernetes2047 the node got a lot of timeouts when trying to reach the kubemaster in codfw before the incident. Then calico crashlooped quite some time during the incident. Then there also was some oom-killing starting at Dec 12 00:13 UTC.

Before the incident (Dec 11 22:39:28) there also was a OOM kill and a significant increase in TCP errors. That's probably related to the timeouts for the kubemaster. But maybe it's also just eventrouter getting overwhelmed by to many Kubernetes events. Maybe somebody can connect the dots :)

Thanks @JMeybohm for helping on IRC identifying this.


From calico logs we can see typha failing to connect to the apiservers around 22:36 (2023-12-11): https://logstash.wikimedia.org/goto/2d45ae99d7fe495907ba1252216e7aac
Around that time both apiservers where unreachable from prometheus as well (gap in metrics): kubemaster2001 / kubemaster2002

  • mediawiki train/backportt ~22:15
  • elevated api requests 22:20 & 22:35
  • high disk IO on masters, probably due to logging
  • we're running upper bound on memory
  • 2001 oomk at 23:17 and 23:41, second one killed kube-apiserver
  • 2002 oom killed kube-apiserver ~23:50

Event Timeline

JMeybohm renamed this task from kubernetes2047 lost all pods (unhealthy) to Outage of wikikube codfw apiservers.Dec 12 2023, 11:50 AM
JMeybohm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-12-12T12:45:27Z] <jayme> increasing memory of ganeti instance kubemaster2001.codfw.wmnet from 4G to 12G (requires reboot) - T353233

VM kubemaster2001.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G

VM kubemaster1001.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G

VM kubemaster2002.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G

VM kubemaster1002.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase from 4G to 12G

Change 982403 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubestagemaster: Add http probe

https://gerrit.wikimedia.org/r/982403

Change 982403 merged by JMeybohm:

[operations/puppet@production] kubestagemaster: Add http probe

https://gerrit.wikimedia.org/r/982403

Change 982819 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master Add blackbox checks for kuber-apiserver

https://gerrit.wikimedia.org/r/982819

Change 982819 merged by JMeybohm:

[operations/puppet@production] kubernetes::master Add blackbox checks for kuber-apiserver

https://gerrit.wikimedia.org/r/982819

Change 982852 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master Add blackbox checks for kube-apiserver

https://gerrit.wikimedia.org/r/982852

Change 982852 merged by JMeybohm:

[operations/puppet@production] kubernetes::master Add blackbox checks for kube-apiserver

https://gerrit.wikimedia.org/r/982852

Change 982858 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master Group blackbox checks per cluster

https://gerrit.wikimedia.org/r/982858

Change 982858 merged by JMeybohm:

[operations/puppet@production] kubernetes::master Group blackbox checks per cluster

https://gerrit.wikimedia.org/r/982858

Change 982863 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master Fix syntax error concatenating strings

https://gerrit.wikimedia.org/r/982863

Change 982863 merged by JMeybohm:

[operations/puppet@production] kubernetes::master Fix syntax error concatenating strings

https://gerrit.wikimedia.org/r/982863

Change 982889 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master Fix logic for certificate_expiry_days

https://gerrit.wikimedia.org/r/982889

Change 982889 merged by JMeybohm:

[operations/puppet@production] kubernetes::master Fix logic for certificate_expiry_days

https://gerrit.wikimedia.org/r/982889

VM kubemaster2001.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase vCPUs from 2 to 4

VM kubemaster2002.codfw.wmnet rebooted by jayme@cumin2002 with reason: increase vCPUs from 2 to 4

VM kubemaster1001.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase vCPUs from 2 to 4

Change 983164 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Exclude custom probes from generic alerts

https://gerrit.wikimedia.org/r/983164

VM kubemaster1002.eqiad.wmnet rebooted by jayme@cumin2002 with reason: increase vCPUs from 2 to 4

Change 983164 merged by jenkins-bot:

[operations/alerts@master] Exclude custom probes from generic alerts

https://gerrit.wikimedia.org/r/983164

Change 983359 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Switch blackbox check to paging

https://gerrit.wikimedia.org/r/983359

Change 983359 merged by JMeybohm:

[operations/puppet@production] kubernetes::master: Switch blackbox check to paging

https://gerrit.wikimedia.org/r/983359

Resolving this as the immediate problem is resolved and remaining follow-ups have their own tasks