Change Details

= Intro/Summary = During the investigation of the sessionstore incident [1] on 2020-06-11 we observed that the kubernetes cluster was under IO load. Considering back then the issue as a contributing factor to the incident, we added to the action items[2] and went on to perform an investigation on the next day. This task has been filed after the investigation so it's a summary/recap of the actions taken. With hindsight being 20/20, it seems that the issue did not have an exacerbating factor in the issue, nevertheless the issue was dealt with. The IOwait across the cluster is visible during the timeperiod in {F31874333} = Investigation = On the next day of the sessionstore incident, we went ahead and tried to find out what caused the IO wait on the kubernetes nodes. Interestingly, the load seemed to shift between nodes and was always local to 1 node. It also seemed to be transient, making it difficult to pinpoint. Using different methods, 2 different SRE people managed to go ahead and figure out that it was caused by changeprop, and in fact always 1 specific changeprop pod. In changeprop, each worker is responsible for 1 topic. When @hnowlan, joined the investigation the rest of the team became aware of the following, that a specific topic had an increasing backlog and the corresponding pod was being frequently killed for OOM errors (out of memory errors). The graph of the backlog is below {F31874747} It also became evident pretty quickly that changeprop was also routinely CPU throttled. Graph below {F31874753} A plan was hatched. It was first to bump CPU to counter throttling, and depending on the outcome proceed with a memory increase. The first efforts to resolve the issue went ahead and @akosiaris went ahead and deploy 2 different CPU changes to increase the CPU limits for changeprop. Those were 078593121fe89dc019936c201e3ccca0de6dd4b7 (100% increase) and (72a08b3554d008d7a05d6a90e339af39ce8f64cc). Things started getting better. Overall CPU decreased as less throttling was happening as well as memory usage. The issue was still persisting though. Nobody could yet account for the high IOwait. When @Pchelolo joined, it became clear that 1 specific worker was dying repeatedly. It was in the same pod that caused the issues and was being OOM killed every now and then. The worker was being killed by the service-runner master because it failed to answer to heartbeat requests. The plausible cause for not answering to heartbeat requests being the excessive throttling. However on every worker restart, a lot of code had to be loaded from the disk, as nodejs loads libraries and code dynamically and not on startup. Given the throttling and the fact that the worker had to wait because of it, this manifested as IOwait. A couple of more things contributed to the problem. * The CPU throttling caused memory increases as the worker was unable to execute the jobs fast enough. * Memory limits for that worker were at 1G (as the rest of changeprop) and eventually the limit was being reached. When that happened, GC would go into a pathological situation were it was actively trying to perform garbage collections, only to be throttled causing more of the same. With ada1e3734982, @hnowlan increased the memory limits as well, at which point the issue stopped. The effect is evident also in {F31874830} where it's obvious that once the memory limit stopped being hit, the remaining CPU throttling stopped as well. The backlog of jobs cleared over the weekend and the service is now functioning normally. = Caveat = One thing that made this more confusing to reason about and debug is that, since every worker/pod cares to 1 topic, resource usage across the various pods is not uniform. So grafana graphs that used methods like `avg()` failed to catch the outlier in CPU and memory usage which was that specific offending pod. GC duration graph before and after the changes is below {F31874799} = Conclusion = Changeprop was being CPU throttled and memory limited, but only for 1 of the pods and thus only 1 of the jobs (there is a 1-1 mapping between pod, worker and job/kafka topic). = References = [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes#Actionables [1] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&from=1590449757220&to=1592001815401&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&var-instance=All

= Intro/Summary = During the investigation of the sessionstore incident [1] on 2020-06-11 we observed that the kubernetes cluster was under IO load. Considering back then the issue as a contributing factor to the incident, we added to the action items[2] and went on to perform an investigation on the next day. This task has been filed after the investigation so it's a summary/recap of the actions taken. With hindsight being 20/20, it seems that the issue did not have an exacerbating factor in the issue, nevertheless the issue was dealt with. The IOwait across the cluster is visible during the timeperiod in {F31874333} = Investigation = On the next day of the sessionstore incident, we went ahead and tried to find out what caused the IO wait on the kubernetes nodes. Interestingly, the load seemed to shift between nodes and was always local to 1 node. It also seemed to be transient, making it difficult to pinpoint. Using different methods, 2 different SRE people managed to go ahead and figure out that it was caused by changeprop, and in fact always 1 specific changeprop pod. In changeprop, each worker is responsible for 1 topic. When @hnowlan, joined the investigation the rest of the team became aware of the following, that a specific topic had an increasing backlog and the corresponding pod was being frequently killed for OOM errors (out of memory errors). The graph of the backlog is below {F31874747} It also became evident pretty quickly that changeprop was also routinely CPU throttled. Graph below {F31874753} A plan was hatched. It was first to bump CPU to counter throttling, and depending on the outcome proceed with a memory increase. The first efforts to resolve the issue went ahead and @akosiaris went ahead and deploy 2 different CPU changes to increase the CPU limits for changeprop. Those were 078593121fe89dc019936c201e3ccca0de6dd4b7 (100% increase) and (72a08b3554d008d7a05d6a90e339af39ce8f64cc). Things started getting better. Overall CPU decreased as less throttling was happening as well as memory usage. The issue was still persisting though. Nobody could yet account for the high IOwait. When @Pchelolo joined, it became clear that 1 specific worker was dying repeatedly. It was in the same pod that caused the issues and was being OOM killed every now and then. The worker was being killed by the service-runner master because it failed to answer to heartbeat requests. The plausible cause for not answering to heartbeat requests being the excessive throttling. However on every worker restart, a lot of code had to be loaded from the disk, as nodejs loads libraries and code dynamically and not on startup. Given the throttling and the fact that the worker had to wait because of it, this manifested as IOwait. A couple of more things contributed to the problem. * The CPU throttling caused memory increases as the worker was unable to execute the jobs fast enough. * Memory limits for that worker were at 1G (as the rest of changeprop) and eventually the limit was being reached. When that happened, GC would go into a pathological situation were it was actively trying to perform garbage collections, only to be throttled causing more of the same. With ada1e3734982, @hnowlan increased the memory limits as well, at which point the issue stopped. The effect is evident also in {F31874830} where it's obvious that once the memory limit stopped being hit, the remaining CPU throttling stopped as well. The backlog of jobs cleared over the weekend and the service is now functioning normally. = Caveat = One thing that made this more confusing to reason about and debug is that, since every worker/pod cares to 1 topic, resource usage across the various pods is not uniform. So grafana graphs that used methods like `avg()` failed to catch the outlier in CPU and memory usage which was that specific offending pod. GC duration graph before and after the changes is below {F31874799} = Actionables = * The worker heartbeat functionality failed to help and in fact was a big contributing factor to the issue. @Pchelolo/@Hnowlan have been kind enough to add support to service-runner for disabling it and even proceeded with a deploy that turned off the heartbeat. Those were https://github.com/wikimedia/service-runner/pull/229 (enabling turning off the worker heartbeat) and dc9f5918c470d33d89282faa57bec9792f6225e2 (deploying that). = Conclusion = Changeprop was being CPU throttled and memory limited, but only for 1 of the pods and thus only 1 of the jobs (there is a 1-1 mapping between pod, worker and job/kafka topic). = References = [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes#Actionables [1] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&from=1590449757220&to=1592001815401&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&var-instance=All