Page MenuHomePhabricator

Alert "kubelet operational latencies"
Closed, ResolvedPublic

Description

For the past 2 days, the following is flooding wikimedia-operations for about 3 hours long, twice on both days:

PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1

Thresholds in Puppet (source): 300 ms (warning), 450 ms (critical).

Dashboard:

Last 24 hours
Last 7 daysEqiad/Codfw (last 7 days)

Event Timeline

Krinkle created this task.Mar 30 2019, 3:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2019, 3:49 AM
Krinkle updated the task description. (Show Details)Mar 30 2019, 3:50 AM
akosiaris added a subscriber: akosiaris.
crusnov added a subscriber: crusnov.Apr 2 2019, 2:08 PM

Minor suggestion, perhaps we could increase the alert threshold if operation isn't actually affected at these levels. Quite often kubelet will sit on the alert threshold and flap alerts.

Mentioned in SAL (#wikimedia-operations) [2019-04-02T16:00:23Z] <mutante> icinga - schedule (30d) downtime for kubernetes operational latencies alerts (T219696) on kubernetes1004

akosiaris closed this task as Resolved.EditedApr 3 2019, 7:54 PM

Culprit identified.

On Thu Mar 28 15:07:55 2019 a new version of the eventgate-analytics chart was deployed to both codfw and eqiad. That new version introduced a different readinessProbe. That probe was added in 66a62a59a6a7d050aa8a. This is a valid use case and pattern and was thoroughly discussed before being introduced. That form of the readiness probe uses the exec_sync operation. The latencies across all hosts increased for this operation, from ~0 to ~2.5s. This is unknown ground for us, but the value doesn't seem unreasonable given that it involves producing a test event to kafka. I think that bacbc62d90 is indeed the way to go on this one. Thanks @crusnov.