https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581018813182&to=1581025628155&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds is a graph of wikifeeds during the outage yesterday. The CPU throttling is very aggressive there, meaning the service did not have adequate resources to serve the requests in time. That ended up depooling the pods one by one until none were left to serve the load. That trigged the obvious alerts upon which we investigated and resolve the issue by restarting all pods, as they were probably non salvageable in any decent amount of time. In fact, judging by the output of kubectl get pods some were occasionally repooled, only to be flooded with requests once more, rendered quickly unable to serve more traffic and again being depooled, leading to a self-sustaining downward spiral, out of which is was difficult to get.
One interesting thing to note is that even during normal operating conditions, there is CPU throttling as well. This should not happen either, as it puts the service already at stress. Bumping the limits overall should alleviate even that.
Limits have been increased to 2.5 cores. However the app is still mildly throttled . Given the limits is 1.5 times more than the current total usage, I am inclined to think this is a scheduler artifact. We 've seen it before with kask and there's a lot of talk about it. It's essentially a recap of 512ac999 of the linux kernel. Interestingly after the deploy, latencies dropped (avg , p99 ) by some 25ms
 https://lkml.org/lkml/2019/5/17/581 and overall https://bugzilla.kernel.org/show_bug.cgi?id=198197
The capacity increase did not fix anything, neither did some efforts with increasing requests/limits more. In fact the sum of throttled times got a 50% increase, which adds more value to the hypothesis about CFS quota issues. A TL;DR is that all pods, regardless of the amount of work they do, got mildly throttled because of linux CFS schedulers accounting for every chunk of time allocated to a task, even if the task has yielded.
The fixes are in 4.19 but we are in 4.9 and it's not definite they will make it into it.
I don't think there is much more we can do for now. The situation has vastly improved and given we also have 50% more capacity we should be in a far better shape in case another incident resembling this happens. Also, latencies have dropped by ~25ms, especially for the p99s.
I 'll resolve this for now, the overall CFS issue is probably worth it to be followed up in a different task.