Page MenuHomePhabricator

Investigate the iowait issues plaguing kubernetes nodes since 2020-05-29
Closed, ResolvedPublic

Description

Intro/Summary

During the investigation of the sessionstore incident [1] on 2020-06-11 we observed that the kubernetes cluster was under IO load. Considering back then the issue as a contributing factor to the incident, we added to the action items[2] and went on to perform an investigation on the next day. This task has been filed after the investigation so it's a summary/recap of the actions taken. With hindsight being 20/20, it seems that the issue did not have an exacerbating factor in the issue, nevertheless the issue was dealt with.

The IOwait across the cluster is visible during the timeperiod in

Investigation

On the next day of the sessionstore incident, we went ahead and tried to find out what caused the IO wait on the kubernetes nodes. Interestingly, the load seemed to shift between nodes and was always local to 1 node. It also seemed to be transient, making it difficult to pinpoint. Using different methods, 2 different SRE people managed to go ahead and figure out that it was caused by changeprop, and in fact always 1 specific changeprop pod.

In changeprop, each worker is responsible for 1 topic. When @hnowlan, joined the investigation the rest of the team became aware of the following, that a specific topic had an increasing backlog and the corresponding pod was being frequently killed for OOM errors (out of memory errors). The graph of the backlog is below

It also became evident pretty quickly that changeprop was also routinely CPU throttled. Graph below

A plan was hatched. It was first to bump CPU to counter throttling, and depending on the outcome proceed with a memory increase.

The first efforts to resolve the issue went ahead and @akosiaris went ahead and deploy 2 different CPU changes to increase the CPU limits for changeprop. Those were 078593121fe89dc019936c201e3ccca0de6dd4b7 (100% increase) and (72a08b3554d008d7a05d6a90e339af39ce8f64cc). Things started getting better. Overall CPU decreased as less throttling was happening as well as memory usage. The issue was still persisting though. Nobody could yet account for the high IOwait.

When @Pchelolo joined, it became clear that 1 specific worker was dying repeatedly. It was in the same pod that caused the issues and was being OOM killed every now and then. The worker was being killed by the service-runner master because it failed to answer to heartbeat requests. The plausible cause for not answering to heartbeat requests being the excessive throttling. However on every worker restart, a lot of code had to be loaded from the disk, as nodejs loads libraries and code dynamically and not on startup. Given the throttling and the fact that the worker had to wait because of it, this manifested as IOwait.

A couple of more things contributed to the problem.

  • The CPU throttling caused memory increases as the worker was unable to execute the jobs fast enough.
  • Memory limits for that worker were at 1G (as the rest of changeprop) and eventually the limit was being reached. When that happened, GC would go into a pathological situation were it was actively trying to perform garbage collections, only to be throttled causing more of the same.

With ada1e3734982, @hnowlan increased the memory limits as well, at which point the issue stopped. The effect is evident also in

where it's obvious that once the memory limit stopped being hit, the remaining CPU throttling stopped as well.

The backlog of jobs cleared over the weekend and the service is now functioning normally.

Caveat

One thing that made this more confusing to reason about and debug is that, since every worker/pod cares to 1 topic, resource usage across the various pods is not uniform. So grafana graphs that used methods like avg() failed to catch the outlier in CPU and memory usage which was that specific offending pod.

GC duration graph before and after the changes is below

Actionables

Conclusion

Changeprop was being CPU throttled and memory limited, but only for 1 of the pods and thus only 1 of the jobs (there is a 1-1 mapping between pod, worker and job/kafka topic).

References

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes#Actionables
[1] https://grafana.wikimedia.org/d/000000607/cluster-overview?panelId=87&fullscreen&orgId=1&from=1590449757220&to=1592001815401&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&var-instance=All

Event Timeline

akosiaris updated the task description. (Show Details)
akosiaris updated the task description. (Show Details)
akosiaris added subscribers: hnowlan, Pchelolo.

@JMeybohm @Pchelolo @hnowlan: This is an account of the investigation we went through for the changeprop memory/cpu limiting.
It's part of the sessionstore incident[1] actionables, although it's after all largely unrelated. Please proof read and correct any errors I might have made.

Many thanks!

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200611-sessionstore%2Bkubernetes

This looks great, thanks @akosiaris! The only thing I'd note (not sure if it even warrants inclusion in the body of the report but it's an interesting correlation) is how closely CPU throttling was linked with memory maxing out in the container. The troughing seen is of course the pod being OOM killed but the peaks correspond quite closely with memory usage peaking.

This looks great, thanks @akosiaris! The only thing I'd note (not sure if it even warrants inclusion in the body of the report but it's an interesting correlation) is how closely CPU throttling was linked with memory maxing out in the container. The troughing seen is of course the pod being OOM killed but the peaks correspond quite closely with memory usage peaking.

Indeed, added in the description. Thanks!

Thanks for writing this up @akosiaris! I think it would be nice to have the follow up tasks linked here. Like the removal of the service-runner and splitting up changeprop into multiple deployments (one per topic?).
Maybe we should also add a follow op to alert/warn on OOMKs / Container restarts?

Thanks for writing this up @akosiaris! I think it would be nice to have the follow up tasks linked here. Like the removal of the service-runner and splitting up changeprop into multiple deployments (one per topic?).

Good points. I guess you mean the removal of the worker heartbeat with the first one. That's been done in https://github.com/wikimedia/service-runner/pull/229 (enabling turning off the worker heartbeat) and dc9f5918c470d33d89282faa57bec9792f6225e2 (deploying that).

For the latter (perhaps splitting changeprop to multiple deployments), I 'll defer to @Pchelolo if that's an actionable they 'd like to pursue.

Maybe we should also add a follow op to alert/warn on OOMKs / Container restarts?

Yes, +1

I 'll mark this as resolved, I think we 've investigate this convincingly, feel free to reopen though.