Page MenuHomePhabricator

Re-evaluate kubelet operation latencies alerts
Open, LowPublic

Description

Since 2019-03-26 we 've seen an increased rate of kubelet operational latency alerts from kubernetes. Those usually recover quickly, but in a number of cases, especially lately they seem to be flapping a lot. Some rough numbers indicate 525 individual alerts from that date to 2019-04-12, most of which informational and indicative of an issue but not directly actionable. We should re-evaluate how we currently alert on this and implement better alerts.

For posterity's sake, when those alerts were introduced but in 2017-12-11 they were added in the spirit of "We have no experience with this, we don't know what exactly to alert on, so let's monitor and alert on all latencies increases and improve from there".

The thresholds have been bump a number of time since then, namely in 50fc9afe2489a4 and bacbc62d909 but more in a reactionary manner than a re-evaluation. T219696 has also been opened, and has been resolved as the root cause was identified (the latter git commit above was the resolution)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2019, 11:27 AM
akosiaris triaged this task as High priority.Apr 12 2019, 11:47 AM

Today (2019-04-12), I 've raised the possibility that T220661 is related to the reason these alerts are flapping so much.

Judging from
https://grafana.wikimedia.org/d/000000436/kubernetes-kubelets?orgId=1&from=1555033207152&to=1555035134345 for eqiad, it becomes apparent that some pod required to be stopped and took 30s for that to happen (this needs to be investigate more, probably in T220661 I am guessing SIGKILL was required after 30s), with another 3.9s for the replacing pod to be created. This bumps the kubernetes1001 latencies close to 300ms. Which isn't enough to alert in this specific hand picked case, but it could indicate a contributing factor.

jijiki added a subscriber: jijiki.Apr 12 2019, 12:08 PM
akosiaris added a comment.EditedApr 12 2019, 12:32 PM

A breakdown of the alerts per host follows starting from 2019-03-26 to 2019-04-12 follows

89 instance=kubernetes2001
84 instance=kubernetes1001
74 instance=kubernetes2002
70 instance=kubernetes2004
62 instance=kubernetes1002
49 instance=kubernetes2003
48 instance=kubernetes1004
48 instance=kubernetes1003
DC# alerts
codfw282 alerts
eqiad242 alerts

So codfw seems to be flapping 16% more often. This is reinforced by https://grafana.wikimedia.org/d/000000436/kubernetes-kubelets?panelId=26&fullscreen&orgId=1&from=now-30d&to=now-1m and https://grafana.wikimedia.org/d/000000436/kubernetes-kubelets?panelId=25&fullscreen&orgId=1&from=now-30d&to=now-1m which point to a greater variance in as well as mean execution for exec_sync operations.

Namely eqiad since 2019-04-09 (which is the point after which eqiad's pattern is changing considerably)

DCavgmaxmintotal
eqiad1.6312.971.35290
codfw3.277.431.33581

The max for eqiad is actually a spike, whereas in codfw it does look like a periodic increase.

I am thinking about excluding exec_sync operations for a while from the checks to restore faith in the alerts.

Change 503344 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Omit exec_sync operations from kubelet alerts

https://gerrit.wikimedia.org/r/503344

I am thinking about excluding exec_sync operations for a while from the checks to restore faith in the alerts.

FWIW, it might make sense to create multiple operational latencies alerts, one per the operation type that we care for (or groups of alternatively).

Change 503344 merged by Alexandros Kosiaris:
[operations/puppet@production] Omit exec_sync operations from kubelet alerts

https://gerrit.wikimedia.org/r/503344

akosiaris lowered the priority of this task from High to Low.Apr 12 2019, 1:10 PM

Change merged and shepherded into production. I am lowering priority but not resolving as we probably want to evaluate this more