Change Details

I 've been looking a bit more into the fact we seem to have many Evicted[1] pods in our clusters. Per the docs, this is expected and it's essentially a self healing mechanism. When a node is under pressure, pods are evicted and scheduled on other nodes. However, I got into wondering why and after digging a bit more into why this is happening on e.g. kubernetes2002 I got ``` kubectl describe node kubernetes2002.codfw.wmnet [snip] OutOfDisk False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 Sep 2020 14:08:40 +0000 Tue, 29 Sep 2020 08:48:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 30 Sep 2020 14:08:40 +0000 Wed, 18 Dec 2019 13:08:21 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 30 Sep 2020 14:08:40 +0000 Wed, 09 Sep 2020 08:45:56 +0000 KubeletReady kubelet is posting ready status [snip] ``` That disk pressure on the 29th coincides with this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1601334645429&to=1601375283466&var-server=kubernetes2002&var-datasource=thanos&var-cluster=kubernetes Looking into it a bit more, this is a pattern {F32368895} https://grafana.wikimedia.org/d/TSqGhbKGz/kubernetes-changeprop-killing?viewPanel=2&orgId=1. Note that in that graph there are at least 2 different patterns regarding / space, I am focusing on just 1 of the too for now Correlating a bit and figuring out which node is currently under an increasing disk spare usage I managed to figure out the following ``` kubernetes2007$ du -h /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log 4.4G /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log ``` Which seems a lot for a pods that is logging since `2020-09-30T11:19:04.543Z` this morning. And sure enough ``` deploy1001$ kubectl logs changeprop-production-5657c7f9f7-6ml2c changeprop-production|jq .message | sort | uniq -c | sort -rn | head -50 1867744 "Old event processed" 65 "Error during deduplication" 24 "Subscribing based on basic topics" 8 "Exec error in cpjobqueue" 4 null 1 "Subscribing based on regex" ``` A quick look into 1 of the "Old event processed" entries is ``` lang=json { "name": "cpjobqueue", "hostname": "changeprop-production-5657c7f9f7-6ml2c", "pid": 141, "level": "WARN", "message": "Old event processed", "event_str": "{\"$schema\":\"/mediawiki/job/1.0.0\",\"meta\":{\"uri\":\"https://placeholder.invalid/wiki/Special:Badtitle\",\"request_id\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\",\"id\":\"92ae1a2e-2c94-4026-ab17-0e86a2b68b20\",\"dt\":\"2020-09-23T18:22:03Z\",\"domain\":\"www.wikidata.org\",\"stream\":\"cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite\"},\"database\":\"wikidatawiki\",\"type\":\"cirrusSearchElasticaWrite\",\"sha1\":\"cf374dcb968a0cb4bc1cb0c4454f42a2f7d5a78f\",\"params\":{\"method\":\"sendData\",\"arguments\":[\"content\",[{\"data\":{\"version\":1281239629,\"wiki\":\"wikidatawiki\",\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Q3514444\",\"timestamp\":\"2020-09-23T18:22:02Z\",\"create_timestamp\":\"2013-01-22T06:26:09Z\",\"redirect\":[],\"incoming_links\":1},\"params\":{\"_id\":\"3346458\",\"_type\":\"\",\"_index\":\"\",\"_cirrus_hints\":{\"BuildDocument_flags\":0,\"noop\":{\"version\":\"documentVersion\",\"incoming_links\":\"within 20%\"}}},\"upsert\":true}]],\"cluster\":\"cloudelastic\",\"createdAt\":1600885323,\"errorCount\":0,\"retryCount\":0,\"requestId\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\"},\"mediawiki_signature\":\"e7a227104fbd3b0bbf97fd600180b646df862281\"}", "stream": "cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite", "rule_name": "cirrusSearchElasticaWrite", "executor": "RuleExecutor", "levelPath": "warn/oldevent", "msg": "Old event processed", "time": "2020-09-30T11:21:16.717Z", "v": 0 } ``` My gut feeling is that there is no way that 1.8M log entries in the space of 5 hours is of any use to anyone. I 'd suggest we lower this to INFO instead of WARN and bump the config to not log info. For data's sake here's the allocation per level in that pod ``` deploy1001$ kubectl -n changeprop-jobqueue logs changeprop-production-5657c7f9f7-6ml2c changeprop-production | jq .level | sort | uniq -c | sort -rn 2362407 "WARN" 65 "ERROR" 40 "INFO" ``` [1] https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

I 've been looking a bit more into the fact we seem to have many Evicted[1] pods in our clusters. Per the docs, this is expected and it's essentially a self healing mechanism. When a node is under pressure, pods are evicted and scheduled on other nodes. However, I got into wondering why and after digging a bit more into why this is happening on e.g. kubernetes2002 I got ``` kubectl describe node kubernetes2002.codfw.wmnet [snip] OutOfDisk False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 Sep 2020 14:08:40 +0000 Tue, 29 Sep 2020 08:48:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 30 Sep 2020 14:08:40 +0000 Wed, 18 Dec 2019 13:08:21 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 30 Sep 2020 14:08:40 +0000 Wed, 09 Sep 2020 08:45:56 +0000 KubeletReady kubelet is posting ready status [snip] ``` That disk pressure on the 29th coincides with this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1601334645429&to=1601375283466&var-server=kubernetes2002&var-datasource=thanos&var-cluster=kubernetes Looking into it a bit more, this is a pattern {F32368895} https://grafana.wikimedia.org/d/TSqGhbKGz/kubernetes-changeprop-killing?viewPanel=2&orgId=1. Note that in that graph there are at least 2 different patterns regarding "/" space, I am focusing on just 1 of the two for now Correlating a bit and figuring out which node is currently under an increasing disk spare usage I managed to figure out the following ``` kubernetes2007$ du -h /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log 4.4G /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log ``` Which seems a lot for a pods that is logging since `2020-09-30T11:19:04.543Z` this morning. And sure enough ``` deploy1001$ kubectl logs changeprop-production-5657c7f9f7-6ml2c changeprop-production|jq .message | sort | uniq -c | sort -rn | head -50 1867744 "Old event processed" 65 "Error during deduplication" 24 "Subscribing based on basic topics" 8 "Exec error in cpjobqueue" 4 null 1 "Subscribing based on regex" ``` A quick look into 1 of the "Old event processed" entries is ``` lang=json { "name": "cpjobqueue", "hostname": "changeprop-production-5657c7f9f7-6ml2c", "pid": 141, "level": "WARN", "message": "Old event processed", "event_str": "{\"$schema\":\"/mediawiki/job/1.0.0\",\"meta\":{\"uri\":\"https://placeholder.invalid/wiki/Special:Badtitle\",\"request_id\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\",\"id\":\"92ae1a2e-2c94-4026-ab17-0e86a2b68b20\",\"dt\":\"2020-09-23T18:22:03Z\",\"domain\":\"www.wikidata.org\",\"stream\":\"cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite\"},\"database\":\"wikidatawiki\",\"type\":\"cirrusSearchElasticaWrite\",\"sha1\":\"cf374dcb968a0cb4bc1cb0c4454f42a2f7d5a78f\",\"params\":{\"method\":\"sendData\",\"arguments\":[\"content\",[{\"data\":{\"version\":1281239629,\"wiki\":\"wikidatawiki\",\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Q3514444\",\"timestamp\":\"2020-09-23T18:22:02Z\",\"create_timestamp\":\"2013-01-22T06:26:09Z\",\"redirect\":[],\"incoming_links\":1},\"params\":{\"_id\":\"3346458\",\"_type\":\"\",\"_index\":\"\",\"_cirrus_hints\":{\"BuildDocument_flags\":0,\"noop\":{\"version\":\"documentVersion\",\"incoming_links\":\"within 20%\"}}},\"upsert\":true}]],\"cluster\":\"cloudelastic\",\"createdAt\":1600885323,\"errorCount\":0,\"retryCount\":0,\"requestId\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\"},\"mediawiki_signature\":\"e7a227104fbd3b0bbf97fd600180b646df862281\"}", "stream": "cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite", "rule_name": "cirrusSearchElasticaWrite", "executor": "RuleExecutor", "levelPath": "warn/oldevent", "msg": "Old event processed", "time": "2020-09-30T11:21:16.717Z", "v": 0 } ``` My gut feeling is that there is no way that 1.8M log entries in the space of 5 hours is of any use to anyone. I 'd suggest we lower this to INFO instead of WARN and bump the config to not log info. For data's sake here's the allocation per level in that pod ``` deploy1001$ kubectl -n changeprop-jobqueue logs changeprop-production-5657c7f9f7-6ml2c changeprop-production | jq .level | sort | uniq -c | sort -rn 2362407 "WARN" 65 "ERROR" 40 "INFO" ``` [1] https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

I 've been looking a bit more into the fact we seem to have many Evicted[1] pods in our clusters. Per the docs, this is expected and it's essentially a self healing mechanism. When a node is under pressure, pods are evicted and scheduled on other nodes. However, I got into wondering why and after digging a bit more into why this is happening on e.g. kubernetes2002 I got ``` kubectl describe node kubernetes2002.codfw.wmnet [snip] OutOfDisk False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Wed, 30 Sep 2020 14:08:40 +0000 Fri, 17 Jul 2020 13:56:09 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 Sep 2020 14:08:40 +0000 Tue, 29 Sep 2020 08:48:24 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 30 Sep 2020 14:08:40 +0000 Wed, 18 Dec 2019 13:08:21 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 30 Sep 2020 14:08:40 +0000 Wed, 09 Sep 2020 08:45:56 +0000 KubeletReady kubelet is posting ready status [snip] ``` That disk pressure on the 29th coincides with this https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&from=1601334645429&to=1601375283466&var-server=kubernetes2002&var-datasource=thanos&var-cluster=kubernetes Looking into it a bit more, this is a pattern {F32368895} https://grafana.wikimedia.org/d/TSqGhbKGz/kubernetes-changeprop-killing?viewPanel=2&orgId=1. Note that in that graph there are at least 2 different patterns regarding /"/" space, I am focusing on just 1 of the totwo for now Correlating a bit and figuring out which node is currently under an increasing disk spare usage I managed to figure out the following ``` kubernetes2007$ du -h /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log 4.4G /var/lib/docker/containers/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d/216995e1131f880290ade24e7fa1cf7d6d00cd6bff03466e6279e46dd2f4f41d-json.log ``` Which seems a lot for a pods that is logging since `2020-09-30T11:19:04.543Z` this morning. And sure enough ``` deploy1001$ kubectl logs changeprop-production-5657c7f9f7-6ml2c changeprop-production|jq .message | sort | uniq -c | sort -rn | head -50 1867744 "Old event processed" 65 "Error during deduplication" 24 "Subscribing based on basic topics" 8 "Exec error in cpjobqueue" 4 null 1 "Subscribing based on regex" ``` A quick look into 1 of the "Old event processed" entries is ``` lang=json { "name": "cpjobqueue", "hostname": "changeprop-production-5657c7f9f7-6ml2c", "pid": 141, "level": "WARN", "message": "Old event processed", "event_str": "{\"$schema\":\"/mediawiki/job/1.0.0\",\"meta\":{\"uri\":\"https://placeholder.invalid/wiki/Special:Badtitle\",\"request_id\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\",\"id\":\"92ae1a2e-2c94-4026-ab17-0e86a2b68b20\",\"dt\":\"2020-09-23T18:22:03Z\",\"domain\":\"www.wikidata.org\",\"stream\":\"cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite\"},\"database\":\"wikidatawiki\",\"type\":\"cirrusSearchElasticaWrite\",\"sha1\":\"cf374dcb968a0cb4bc1cb0c4454f42a2f7d5a78f\",\"params\":{\"method\":\"sendData\",\"arguments\":[\"content\",[{\"data\":{\"version\":1281239629,\"wiki\":\"wikidatawiki\",\"namespace\":0,\"namespace_text\":\"\",\"title\":\"Q3514444\",\"timestamp\":\"2020-09-23T18:22:02Z\",\"create_timestamp\":\"2013-01-22T06:26:09Z\",\"redirect\":[],\"incoming_links\":1},\"params\":{\"_id\":\"3346458\",\"_type\":\"\",\"_index\":\"\",\"_cirrus_hints\":{\"BuildDocument_flags\":0,\"noop\":{\"version\":\"documentVersion\",\"incoming_links\":\"within 20%\"}}},\"upsert\":true}]],\"cluster\":\"cloudelastic\",\"createdAt\":1600885323,\"errorCount\":0,\"retryCount\":0,\"requestId\":\"a6b348bc-ddf6-43d7-ab6e-c95a05a673f9\"},\"mediawiki_signature\":\"e7a227104fbd3b0bbf97fd600180b646df862281\"}", "stream": "cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite", "rule_name": "cirrusSearchElasticaWrite", "executor": "RuleExecutor", "levelPath": "warn/oldevent", "msg": "Old event processed", "time": "2020-09-30T11:21:16.717Z", "v": 0 } ``` My gut feeling is that there is no way that 1.8M log entries in the space of 5 hours is of any use to anyone. I 'd suggest we lower this to INFO instead of WARN and bump the config to not log info. For data's sake here's the allocation per level in that pod ``` deploy1001$ kubectl -n changeprop-jobqueue logs changeprop-production-5657c7f9f7-6ml2c changeprop-production | jq .level | sort | uniq -c | sort -rn 2362407 "WARN" 65 "ERROR" 40 "INFO" ``` [1] https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/