Page MenuHomePhabricator

Page allocation stalls on scb1001, scb1002
Open, Stalled, NormalPublic

Description

On March 30th, March 31st and April 1st 2018 we got the following on scb1001, scb1002 boxes

Mar 30 15:54:53 scb1001 kernel: [2697812.137512] nodejs: page allocation stalls for 10464ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
...
Mar 31 16:37:32 scb1001 kernel: [2786787.587176] nodejs: page allocation stalls for 17700ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
...
Apr  1 10:31:29 scb1001 kernel: [2850864.303408] AudioThread: page allocation stalls for 11860ms, order:0, mode:0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD)
Apr  1 10:31:29 scb1001 kernel: [2850864.309376] nodejs: page allocation stalls for 11088ms, order:0, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

Overall 147 such incidents on scb1001 and another 302 on scb1002 over the course of the 3 days. The week previous to that however is clean .

OOM killer did show up and killed nodejs processes and electron. It does seem like those 2 boxes are under extra stress as CPU and memory graphs show on https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?from=now-7d&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=scb&var-instance=All but the cause is not yet known. I 'll lower their weight a bit for the various services as they have less memory than scb1003, scb1004

Event Timeline

akosiaris created this task.Apr 2 2018, 1:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2018, 1:50 PM
akosiaris triaged this task as High priority.Apr 2 2018, 2:07 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-02T14:10:15Z] <akosiaris> lower weight for scb1001, scb1002 from 10 to 8 for all services. T191199. scb1003, scb1004 have a weight of 15 already

I believe this, or something similar related to memory-related stalls happened on scb2006.

I believe this, or something similar related to memory-related stalls happened on scb2006.

Indeed, this seems to be the case. The workers of all services stopped sending heartbeats to their masters and were consequently killed. A worker can stop sending heartbeats if it's event loop is stalled, which can occur either through busy CPU cycles (which did not manifest in this case) or memory allocation.

Volans added a subscriber: Volans.Jul 17 2018, 10:23 AM

I've found the same logs in syslog for the affected hosts, so yes, definitely the same issue.

elukey added a subscriber: elukey.Jul 17 2018, 10:28 AM

It has happened yet again today on scb2003. So far it looks like EventStreams is swallowing memory, cf. T199813: EventStreams accumulates too much memory on SCB nodes in CODFW.

akosiaris changed the task status from Open to Stalled.Jul 25 2018, 11:16 AM

T199813 was closed today (nice work on it). I am thking (and hoping) it was the root cause. I 'll stall this task for a week just for monitoring purposes before we resolve it.

akosiaris lowered the priority of this task from High to Normal.Jul 25 2018, 11:16 AM