I 've been looking a bit more into the fact we seem to have many Evicted[1] pods in our clusters. Per the docs, this is expected and it's essentially a self healing mechanism. When a node is under pressure, pods are evicted and scheduled on other nodes. However, I got into wondering why and after digging a bit more into why this is happening on e.g. kubernetes2002 I got
```
kubectl describe node kubernetes2002.codfw.wmnet
[snip]
OutOfDisk False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 30 Sep 2020 14:08:40 +0000 Tue, 29 Sep 2020 08:48:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 30 Sep 2020 14:08:40 +0000 Wed, 18 Dec 2019 13:08:21 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 30 Sep 2020 14:08:40 +0000 Wed, 09 Sep 2020 08:45:56 +0000 KubeletReady kubelet is posting ready status
[snip]
```
That disk pressure on the 29th coincides with this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1601334645429&to=1601375283466&var-server=kubernetes2002&var-datasource=thanos&var-cluster=kubernetes
Looking into it a bit more, this is a pattern
{F32368895} https://grafana.wikimedia.org/d/TSqGhbKGz/kubernetes-changeprop-killing?viewPanel=2&orgId=1.
Note that in that graph there are at least 2 different patterns regarding / space, I am focusing on just 1 of the too for now
Correlating a bit and figuring out which node is currently under an increasing disk spare usage I managed to figure out the following
```
kubernetes2007$ du -h /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log
4.4G /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log
```
Which seems a lot for a pods that is logging since `2020-09-30T11:19:04.543Z` this morning. And sure enough
```
deploy1001$ kubectl logs changeprop-production-5657c7f9f7-6ml2c changeprop-production|jq .message | sort | uniq -c | sort -rn | head -50
1867744 "Old event processed"
65 "Error during deduplication"
24 "Subscribing based on basic topics"
8 "Exec error in cpjobqueue"
4 null
1 "Subscribing based on regex"
```
A quick look into 1 of the "Old event processed" entries is
```
lang=json
{
"name": "cpjobqueue",
"hostname": "changeprop-production-5657c7f9f7-6ml2c",
"pid": 141,
"level": "WARN",
"message": "Old event processed",
"event_str": "{\"$schema\":\"/mediawiki/job/1.0.0\",\"meta\":{\"uri\":\"https://placeholder.invalid/wiki/Special:Badtitle\",\"request_id\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\",\"id\":\"92ae1a2e-2c94-4026-ab17-0e86a2b68b20\",\"dt\":\"2020-09-23T18:22:03Z\",\"domain\":\"www.wikidata.org\",\"stream\":\"cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite\"},\"database\":\"wikidatawiki\",\"type\":\"cirrusSearchElasticaWrite\",\"sha1\":\"cf374dcb968a0cb4bc1cb0c4454f42a2f7d5a78f\",\"params\":{\"method\":\"sendData\",\"arguments\":[\"content\",[{\"data\":{\"version\":1281239629,\"wiki\":\"wikidatawiki\",\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Q3514444\",\"timestamp\":\"2020-09-23T18:22:02Z\",\"create_timestamp\":\"2013-01-22T06:26:09Z\",\"redirect\":[],\"incoming_links\":1},\"params\":{\"_id\":\"3346458\",\"_type\":\"\",\"_index\":\"\",\"_cirrus_hints\":{\"BuildDocument_flags\":0,\"noop\":{\"version\":\"documentVersion\",\"incoming_links\":\"within 20%\"}}},\"upsert\":true}]],\"cluster\":\"cloudelastic\",\"createdAt\":1600885323,\"errorCount\":0,\"retryCount\":0,\"requestId\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\"},\"mediawiki_signature\":\"e7a227104fbd3b0bbf97fd600180b646df862281\"}",
"stream": "cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite",
"rule_name": "cirrusSearchElasticaWrite",
"executor": "RuleExecutor",
"levelPath": "warn/oldevent",
"msg": "Old event processed",
"time": "2020-09-30T11:21:16.717Z",
"v": 0
}
```
My gut feeling is that there is no way that 1.8M log entries in the space of 5 hours is of any use to anyone. I 'd suggest we lower this to INFO instead of WARN and bump the config to not log info. For data's sake here's the allocation per level in that pod
```
deploy1001$ kubectl -n changeprop-jobqueue logs changeprop-production-5657c7f9f7-6ml2c changeprop-production | jq .level | sort | uniq -c | sort -rn
2362407 "WARN"
65 "ERROR"
40 "INFO"
```
[1] https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/