Page MenuHomePhabricator

wikifeeds - fix the CPU limits so that it doesn't get starved
Closed, ResolvedPublic

Description

A follow-up to incident 20200206-mediawiki where
wikifeeds pods had to be killed after a MediaWiki deploy gone bad.

"Fix the CPU limits so that it doesn't get starved."

Event Timeline

Change 570726 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] wikifeeds: Redefine CPU limits

https://gerrit.wikimedia.org/r/570726

Change 570726 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] wikifeeds: Redefine CPU limits

https://gerrit.wikimedia.org/r/570726

akosiaris triaged this task as High priority.

https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581018813182&to=1581025628155&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds is a graph of wikifeeds during the outage yesterday. The CPU throttling is very aggressive there, meaning the service did not have adequate resources to serve the requests in time. That ended up depooling the pods one by one until none were left to serve the load. That trigged the obvious alerts upon which we investigated and resolve the issue by restarting all pods, as they were probably non salvageable in any decent amount of time. In fact, judging by the output of kubectl get pods some were occasionally repooled, only to be flooded with requests once more, rendered quickly unable to serve more traffic and again being depooled, leading to a self-sustaining downward spiral, out of which is was difficult to get.

One interesting thing to note is that even during normal operating conditions[1], there is CPU throttling as well. This should not happen either, as it puts the service already at stress. Bumping the limits overall should alleviate even that.

[1] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1578547205103&to=1579767975552&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds

Change 570831 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] wikifeeds: slightly lower the CPU limits

https://gerrit.wikimedia.org/r/570831

Change 570831 merged by jenkins-bot:
[operations/deployment-charts@master] wikifeeds: slightly lower the CPU limits

https://gerrit.wikimedia.org/r/570831

Limits have been increased to 2.5 cores. However the app is still mildly throttled [1]. Given the limits is 1.5 times more than the current total usage, I am inclined to think this is a scheduler artifact. We 've seen it before with kask and there's a lot of talk about it. It's essentially a recap of 512ac999[2] of the linux kernel. Interestingly after the deploy, latencies dropped (avg [3], p99 [4]) by some 25ms

[1] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581065740120&to=1581067495826&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds
[2] https://lkml.org/lkml/2019/5/17/581 and overall https://bugzilla.kernel.org/show_bug.cgi?id=198197
[3] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581065740120&to=1581067495826&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds&fullscreen&panelId=23
[4] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=now-1h&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds&refresh=1m&fullscreen&panelId=34

We are definitely better than what we used to be, but I am still not happy. I 'll increase the capacity as well, from 4 pods to 6 pods, that is by 50%

Change 570864 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] wikifeeds: Bump capacity by 50%

https://gerrit.wikimedia.org/r/570864

Change 570864 merged by jenkins-bot:
[operations/deployment-charts@master] wikifeeds: Bump capacity by 50%

https://gerrit.wikimedia.org/r/570864

Mentioned in SAL (#wikimedia-operations) [2020-02-07T10:02:18Z] <akosiaris> increase capacity for wikifeeds by 50% T244535

The capacity increase did not fix anything, neither did some efforts with increasing requests/limits more. In fact the sum of throttled times got a 50% increase, which adds more value to the hypothesis about CFS quota issues. A TL;DR is that all pods, regardless of the amount of work they do, got mildly throttled because of linux CFS schedulers accounting for every chunk of time allocated to a task, even if the task has yielded.

The fixes are in 4.19 but we are in 4.9 and it's not definite they will make it into it.

I don't think there is much more we can do for now. The situation has vastly improved[1] and given we also have 50% more capacity we should be in a far better shape in case another incident resembling this happens. Also, latencies have dropped by ~25ms, especially for the p99s.

I 'll resolve this for now, the overall CFS issue is probably worth it to be followed up in a different task.

[1] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1581047469357&to=1581073151832&var-dc=eqiad%20prometheus%2Fk8s&var-service=wikifeeds&fullscreen&panelId=28