Page MenuHomePhabricator

KubernetesAPILatency alert fires on scap deploy
Closed, ResolvedPublic

Assigned To
Authored By
fgiunchedi
Oct 5 2023, 9:59 AM
Referenced Files
F37987611: image.png
Oct 5 2023, 4:19 PM
F37987608: image.png
Oct 5 2023, 4:19 PM
F37986810: image.png
Oct 5 2023, 11:34 AM
F37986808: image.png
Oct 5 2023, 11:34 AM
F37986719: image.png
Oct 5 2023, 11:23 AM
F37986713: image.png
Oct 5 2023, 11:23 AM
F37986561: 2023-10-05-115540_2385x1051_scrot.png
Oct 5 2023, 9:59 AM
Tokens
"Like" token, awarded by fgiunchedi.

Description

I noticed KubernetesAPILatency firing relatively frequently, to the point that it is essentially noise.

On further inspection it looks like there's a spike up in API latency on running scap, e.g. in codfw:

2023-10-05-115540_2385x1051_scrot.png (1×2 px, 225 KB)

And SAL entries related to scap line up pretty good:

2023-10-04
22:02	<brennen@deploy2002>	Finished scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]] (duration: 09m 13s)	[production]
21:53	<brennen@deploy2002>	Started scap: Backport for [[gerrit:963351|Revert "Deprecate TOC mutation in OutputPageParserOutput hook" (T348134)]]	[production]
20:54	<urbanecm@deploy2002>	Finished scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]] (duration: 07m 49s)	[production]
20:46	<urbanecm@deploy2002>	Started scap: Backport for [[gerrit:963347|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963348|SpecialManageMentors: Skip OOUI initialization when transcluding (T346760)]], [[gerrit:963349|Fix phan for GrowthExperiments (T347571)]]	[production]
14:31	<lucaswerkmeister-wmde@deploy2002>	Finished scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]] (duration: 18m 26s)	[production]
14:12	<lucaswerkmeister-wmde@deploy2002>	Started scap: Backport for [[gerrit:963307|prod: Enable wgCampaignEventsEnableEmail in meta and officewiki (T347065)]]	[production]
14:00	<lucaswerkmeister-wmde@deploy2002>	Finished scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]] (duration: 10m 40s)	[production]
13:49	<lucaswerkmeister-wmde@deploy2002>	Started scap: Backport for [[gerrit:963305|beta: Explicitly assign campaignevents-email-participants to all users (T336939)]], [[gerrit:963306|metawiki: Restrict campaignevents-email-participants right (T336939)]]	[production]
13:46	<lucaswerkmeister-wmde@deploy2002>	Finished scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]] (duration: 13m 46s)	[production]
13:33	<lucaswerkmeister-wmde@deploy2002>	Started scap: Backport for [[gerrit:963036|fonwiki: add wgSiteName, wgMetaNamespace and timezone (T347939)]]	[production]
13:20	<lucaswerkmeister-wmde@deploy2002>	Finished scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]] (duration: 11m 43s)	[production]
13:08	<lucaswerkmeister-wmde@deploy2002>	Started scap: Backport for [[gerrit:963066|fonwiki: add logos (T347939)]]

The spike could be certainly investigated, though in the meantime I think we can bump either the alert thresholds or the for clause to avoid spurious alerts

Event Timeline

Looks like there's some correlation with very bad etcd latency

image.png (500×1 px, 59 KB)

(Confusingly, that graph gives the latency by caller, not by destination)

Looking at etcd metrics for the kubetcd cluster, we can see very bad fsync latencies. Unfortunately, they're Ganeti VMs and already using local storage, so I don't thing there's much we can do there.

image.png (500×1 px, 92 KB)

They're also pretty undersized CPU-wise (1 vCPU).
We can start by the easiest, give them more CPU and see if that helps.

The masters themselves only have 1 vCPU and seem a little undersized memory-wise (ex kubemaster1002)

image.png (500×1 px, 93 KB)

image.png (500×1 px, 64 KB)

I think we've got the capacity to grow them a little, say:

  • +1 vCPU, +1GB RAM for kubemaster
  • +1 vCPU for kubetcd
Clement_Goubert triaged this task as Medium priority.
Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.
Clement_Goubert edited projects, added Prod-Kubernetes; removed MW-on-K8s.

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:38:19Z] <claime> Bumping kubectd100[4-6].eqiad.wmnet vcpu to 2 - T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:38:34Z] <claime> Bumping kubetcd100[4-6].eqiad.wmnet vcpu to 2 - T348228

Icinga downtime and Alertmanager silence (ID=2578e1de-95e1-47af-871c-5fc14c29acc0) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd1006.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:41:35Z] <claime> rebooting kubetcd1006.eqiad.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=4ee976bc-2144-44ed-a885-1325a5720050) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd1005.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:46:39Z] <claime> rebooted kubetcd1005.eqiad.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=e5543a4c-cba1-4246-8253-ffe15dcde108) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd1004.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:47:24Z] <claime> rebooting kubetcd1004.eqiad.wmnet - T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:52:45Z] <claime> Bumping kubemaster100[1-2].eqiad.wmnet vcpu to 2, ram to 4G - T348228

Icinga downtime and Alertmanager silence (ID=1c916c69-a73b-4df2-ab3f-a877744bdad0) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubemaster1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T14:53:27Z] <claime> rebooting kubemaster1002.eqiad.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=7031ccc2-0805-4db8-aa97-55d719b4d006) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubemaster1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:03:19Z] <claime> rebooting kubemaster1001.eqiad.wmnet - T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:06:48Z] <claime> Bumping kubetcd200[4-6].eqiad.wmnet vcpu to 2 - T348228

Icinga downtime and Alertmanager silence (ID=d85168b6-bad9-4157-b743-a5047d55be03) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd2006.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:07:39Z] <claime> rebooting kubetcd2006.codfw.wmnet - T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:09:38Z] <claime> rebooting kubetcd2005.codfw.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=2b7029e7-cb66-4d09-a3ae-01b8726b028e) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd2005.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=52db74f7-7d98-4a04-8418-376cc7ab1496) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubetcd2004.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:13:54Z] <claime> rebooting kubetcd2004.codfw.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=ae3368e2-fef2-4be8-b17e-8e3d2bb121cd) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubemaster2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:20:26Z] <claime> rebooting kubemaster2002.codfw.wmnet - T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T15:25:57Z] <claime> rebooting kubemaster2001.codfw.wmnet - T348228

Icinga downtime and Alertmanager silence (ID=91da260a-99b3-46b1-b016-fe2edefaff1e) set by cgoubert@cumin1001 for 0:15:00 on 1 host(s) and their services with reason: Pick up vcpu change

kubemaster2001.codfw.wmnet

Bumped all kubetcd to 2 vcpu and all kubemasters to 2 vcpu ang 4G ram.
If that isn't sufficient, we may need to think about migrating to real hardware for these, because I fear the limiting factor is disk speed for etcd.

Mentioned in SAL (#wikimedia-operations) [2023-10-05T16:05:50Z] <cgoubert@deploy2002> Started scap: Testing mw-on-k8s deployment for T348228

Mentioned in SAL (#wikimedia-operations) [2023-10-05T16:07:55Z] <cgoubert@deploy2002> Finished scap: Testing mw-on-k8s deployment for T348228 (duration: 02m 15s)

It's a bit better, but still borderline, as the alert is 5 minutes over .5s and we're basically just under the time threshold.

image.png (500×1 px, 50 KB)

etcd latency is a lot better, but still 800ms p99 feels really high

image.png (500×1 px, 69 KB)

Clement_Goubert added a subscriber: elukey.

The alert thresholds were changed by @elukey in https://gerrit.wikimedia.org/r/c/operations/alerts/+/964534
Resolving this task as it is about the alert itself, but keeping the subtask open for kubernetes api performance investigations.