Page MenuHomePhabricator

paws: memory allocation errors in tools-paws-master-01
Closed, ResolvedPublic

Description

There are many memory allocation errors in tools-paws-master-01.eqiad.wmflabs:

aborrero@tools-paws-master-01:~$ sudo dmesg -T | egrep "Out of memory"\|"page allocation stall"
[Wed Jan 24 00:32:57 2018] Out of memory: Kill process 26977 (kube-lego) score 1358 or sacrifice child
[Tue Feb  6 21:03:41 2018] Out of memory: Kill process 20398 (kube-lego) score 1348 or sacrifice child
[Wed Feb 14 12:36:23 2018] sshd: page allocation stalls for 11072ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:36:37 2018] cron: page allocation stalls for 11236ms, order:1, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[Wed Feb 14 12:36:47 2018] cron: page allocation stalls for 20776ms, order:1, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[Wed Feb 14 12:36:57 2018] cron: page allocation stalls for 30956ms, order:1, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[Wed Feb 14 12:37:13 2018] python: page allocation stalls for 10240ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:37:27 2018] python: page allocation stalls for 24124ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:37:37 2018] cron: page allocation stalls for 11876ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:38:46 2018] sshd: page allocation stalls for 11468ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:38:56 2018] sshd: page allocation stalls for 21708ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:39:29 2018] Out of memory: Kill process 2663 (kube-lego) score 1316 or sacrifice child

If we obtain this same report from clush at all nodes, this is the only instance with these issues.

This affects arbitrary processes:

[Wed Feb 14 12:38:56 2018] sshd: page allocation stalls for 21708ms, order:1, mode:0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK)
[Wed Feb 14 12:38:56 2018] CPU: 1 PID: 16147 Comm: sshd Not tainted 4.9.0-5-amd64 #1 Debian 4.9.65-3+deb9u2
[Wed Feb 14 12:38:56 2018] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Ubuntu-1.8.2-1ubuntu1~cloud0 04/01/2014
[Wed Feb 14 12:38:56 2018]  0000000000000000 ffffffff88d29964 ffffffff893feae0 ffffa0b0c12bfc80
[Wed Feb 14 12:38:56 2018]  ffffffff88b8569a 027080c0027080c0 ffffffff893feae0 ffffa0b0c12bfc20
[Wed Feb 14 12:38:56 2018]  ffff91a600000010 ffffa0b0c12bfc90 ffffa0b0c12bfc40 9ff13b276b53b63d
[Wed Feb 14 12:38:56 2018] Call Trace:
[Wed Feb 14 12:38:56 2018]  [<ffffffff88d29964>] ? dump_stack+0x5c/0x78
[Wed Feb 14 12:38:56 2018]  [<ffffffff88b8569a>] ? warn_alloc+0x13a/0x160
[Wed Feb 14 12:38:56 2018]  [<ffffffff88b853da>] ? __alloc_pages_direct_compact+0x4a/0xf0
[Wed Feb 14 12:38:56 2018]  [<ffffffff88b860c5>] ? __alloc_pages_slowpath+0x995/0xbf0
[Wed Feb 14 12:38:56 2018]  [<ffffffff88b8651e>] ? __alloc_pages_nodemask+0x1fe/0x260
[Wed Feb 14 12:38:56 2018]  [<ffffffff88bd6981>] ? alloc_pages_current+0x91/0x140
[Wed Feb 14 12:38:56 2018]  [<ffffffff88b81e3a>] ? __get_free_pages+0xa/0x30
[Wed Feb 14 12:38:56 2018]  [<ffffffff88a6456a>] ? pgd_alloc+0x1a/0x160
[Wed Feb 14 12:38:56 2018]  [<ffffffff88a72d59>] ? mm_init+0x179/0x210
[Wed Feb 14 12:38:56 2018]  [<ffffffff88c0a585>] ? do_execveat_common.isra.37+0x255/0x790
[Wed Feb 14 12:38:56 2018]  [<ffffffff88d57d00>] ? strncpy_from_user+0x10/0x160
[Wed Feb 14 12:38:56 2018]  [<ffffffff88c0ace5>] ? SyS_execve+0x35/0x40
[Wed Feb 14 12:38:56 2018]  [<ffffffff88a03b1c>] ? do_syscall_64+0x7c/0xf0
[Wed Feb 14 12:38:56 2018]  [<ffffffff890076ee>] ? entry_SYSCALL64_slow_path+0x25/0x25
[Wed Feb 14 12:38:56 2018] Mem-Info:
[Wed Feb 14 12:38:56 2018] active_anon:603849 inactive_anon:11901 isolated_anon:0
                            active_file:751 inactive_file:757 isolated_file:0
                            unevictable:2123 dirty:0 writeback:0 unstable:0
                            slab_reclaimable:17539 slab_unreclaimable:52162
                            mapped:3880 shmem:12504 pagetables:4748 bounce:0
                            free:22281 free_pcp:0 free_cma:0

In this case, the sshd daemon failed to allocate memory and this likely was the cause of a page.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2018-02-14T13:04:53Z] <arturo> reboot tools-paws-master-01 for T187315

Mentioned in SAL (#wikimedia-cloud) [2018-02-14T13:09:21Z] <arturo> the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)

aborrero triaged this task as Medium priority.

I think we could leave this task open for a few days and then close it if nothing weird is observed in the server.

If I try to start my session I see:

500 : Internal Server Error

In URL https://paws.wmflabs.org/paws/hub/user/arturoborrero/

But I guess this is not related to this issue?

Closing this now, as the server seems healthy.