Page MenuHomePhabricator

Kubernetes pods are being periodically evicted because of Disk Space pressure caused by cpjobqueue
Open, Stalled, LowestPublic

Description

I 've been looking a bit more into the fact we seem to have many Evicted[1] pods in our clusters. Per the docs, this is expected and it's essentially a self healing mechanism. When a node is under pressure, pods are evicted and scheduled on other nodes. However, I got into wondering why and after digging a bit more into why this is happening on e.g. kubernetes2002 I got

kubectl describe node kubernetes2002.codfw.wmnet
[snip]
OutOfDisk        False   Wed, 30 Sep 2020 14:08:40 +0000   Fri, 17 Jul 2020 13:56:09 +0000   KubeletHasSufficientDisk     kubelet has sufficient disk space available
MemoryPressure   False   Wed, 30 Sep 2020 14:08:40 +0000   Fri, 17 Jul 2020 13:56:09 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
DiskPressure     False   Wed, 30 Sep 2020 14:08:40 +0000   Tue, 29 Sep 2020 08:48:24 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
PIDPressure      False   Wed, 30 Sep 2020 14:08:40 +0000   Wed, 18 Dec 2019 13:08:21 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
Ready            True    Wed, 30 Sep 2020 14:08:40 +0000   Wed, 09 Sep 2020 08:45:56 +0000   KubeletReady                 kubelet is posting ready status
[snip]

That disk pressure on the 29th coincides with this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1601334645429&to=1601375283466&var-server=kubernetes2002&var-datasource=thanos&var-cluster=kubernetes

Looking into it a bit more, this is a pattern

https://grafana.wikimedia.org/d/TSqGhbKGz/kubernetes-changeprop-killing?viewPanel=2&orgId=1.

Note that in that graph there are at least 2 different patterns regarding "/" space, I am focusing on just 1 of the two for now

Correlating a bit and figuring out which node is currently under an increasing disk spare usage I managed to figure out the following

kubernetes2007$ du -h /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log
4.4G	/var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log

Which seems a lot for a pods that is logging since 2020-09-30T11:19:04.543Z this morning. And sure enough

deploy1001$ kubectl logs changeprop-production-5657c7f9f7-6ml2c changeprop-production|jq .message | sort | uniq -c | sort -rn | head -50
1867744 "Old event processed"
     65 "Error during deduplication"
     24 "Subscribing based on basic topics"
      8 "Exec error in cpjobqueue"
      4 null
      1 "Subscribing based on regex"

A quick look into 1 of the "Old event processed" entries is

{
  "name": "cpjobqueue",
  "hostname": "changeprop-production-5657c7f9f7-6ml2c",
  "pid": 141,
  "level": "WARN",
  "message": "Old event processed",
  "event_str": "{\"$schema\":\"/mediawiki/job/1.0.0\",\"meta\":{\"uri\":\"https://placeholder.invalid/wiki/Special:Badtitle\",\"request_id\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\",\"id\":\"92ae1a2e-2c94-4026-ab17-0e86a2b68b20\",\"dt\":\"2020-09-23T18:22:03Z\",\"domain\":\"www.wikidata.org\",\"stream\":\"cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite\"},\"database\":\"wikidatawiki\",\"type\":\"cirrusSearchElasticaWrite\",\"sha1\":\"cf374dcb968a0cb4bc1cb0c4454f42a2f7d5a78f\",\"params\":{\"method\":\"sendData\",\"arguments\":[\"content\",[{\"data\":{\"version\":1281239629,\"wiki\":\"wikidatawiki\",\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Q3514444\",\"timestamp\":\"2020-09-23T18:22:02Z\",\"create_timestamp\":\"2013-01-22T06:26:09Z\",\"redirect\":[],\"incoming_links\":1},\"params\":{\"_id\":\"3346458\",\"_type\":\"\",\"_index\":\"\",\"_cirrus_hints\":{\"BuildDocument_flags\":0,\"noop\":{\"version\":\"documentVersion\",\"incoming_links\":\"within 20%\"}}},\"upsert\":true}]],\"cluster\":\"cloudelastic\",\"createdAt\":1600885323,\"errorCount\":0,\"retryCount\":0,\"requestId\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\"},\"mediawiki_signature\":\"e7a227104fbd3b0bbf97fd600180b646df862281\"}",
  "stream": "cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite",
  "rule_name": "cirrusSearchElasticaWrite",
  "executor": "RuleExecutor",
  "levelPath": "warn/oldevent",
  "msg": "Old event processed",
  "time": "2020-09-30T11:21:16.717Z",
  "v": 0
}

My gut feeling is that there is no way that 1.8M log entries in the space of 5 hours is of any use to anyone. I 'd suggest we lower this to INFO instead of WARN and bump the config to not log info. For data's sake here's the allocation per level in that pod

deploy1001$ kubectl -n changeprop-jobqueue logs changeprop-production-5657c7f9f7-6ml2c changeprop-production | jq .level | sort | uniq -c | sort -rn
2362407 "WARN"
     65 "ERROR"
     40 "INFO"

[1] https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2020, 2:30 PM
akosiaris updated the task description. (Show Details)Sep 30 2020, 2:33 PM
hnowlan claimed this task.Sep 30 2020, 2:43 PM

Change 631200 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/deployment-charts@master] changeprop: lower log level to error

https://gerrit.wikimedia.org/r/631200

akosiaris updated the task description. (Show Details)Sep 30 2020, 3:51 PM

Change 631200 merged by jenkins-bot:
[operations/deployment-charts@master] changeprop: lower log level to error

https://gerrit.wikimedia.org/r/631200

The changeprop change about the changeprop service above seems to have solved the daily saw like pattern

Which allowed the 2nd pattern I alluded to in the task to become visible now. Those 3 hosts that seem to converge all together on the right hand side are about cpjobqueue. I guess a similar change is required for cpjobqueue.

Change 632289 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/deployment-charts@master] changeprop-jobqueue: Turn down loglevel

https://gerrit.wikimedia.org/r/632289

Change 632289 merged by jenkins-bot:
[operations/deployment-charts@master] changeprop-jobqueue: Turn down loglevel

https://gerrit.wikimedia.org/r/632289

cpjobqueue has now been tuned down as well

akosiaris changed the task status from Open to Stalled.Oct 6 2020, 9:07 AM
akosiaris triaged this task as Lowest priority.

I think it's showing already

I 'll switch to stalled and let it be for a week or so (maybe some third pattern will slowly arrise) and then resolve.

Many thanks for taking care of this!

BPirkle added a subscriber: BPirkle.

Please retag Platform Engineering if it turns out we are needed.