mw* servers memory leaks (12 Aug)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jijiki
	Aug 12 2020, 8:17 PM

Description

We noticed that some servers have been leaking memory possibly after the 4th of August. There are suspicions that it might be related to the following, but nothing is proven yet:

06:13	<tstarling@deploy1001>	Synchronized wmf-config/CommonSettings.php: re-enabling lilypond execution in safe mode 3rd attempt (duration: 00m 58s)

See https://sal.toolforge.org/log/TsQauHMBLkHzneNNt1NY.

Another potential trigger is the puppet change MW firejail: blacklist /run and conf cache given that many calls to mem_cgroup_create were found to be leaking.

We are letting mw1357 depooled so to investigate, @CDanis has installed bpfcc-tools on some appservers, and we have started rebooting both API and APP servers

https://grafana.wikimedia.org/d/000000377/host-overview?panelId=4&fullscreen&orgId=1&from=1596412800000&to=1597276799000&var-server=mw1357&var-datasource=thanos&var-cluster=api_appserver

TLDR for third parties: if you run MediaWiki with $wgShellRestrictionMethod = 'firejail'; you should do so with a relatively recent kernel, 4.19+ or 5.3+ IIRC.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Joe	T260281 mw* servers memory leaks (12 Aug)
					Restricted Task

Event Timeline

jijiki created this task.Aug 12 2020, 8:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2020, 8:17 PM

ArielGlenn subscribed.Aug 12 2020, 8:20 PM

Krinkle added a project: Sustainability (Incident Followup).Aug 12 2020, 9:05 PM

Here's a summary of my findings so far, although keep in mind that I'm well out of my depth here and this is an amalgamation of guesswork.

mw1357 is a server that was depooled and left in the leaky state without being rebooted, to preserve for investigation.

The only thing that shows up in /proc/meminfo as being obviously absurd are a very high number of 4k-sized pages allocated in the TLB. There is an abnormal amount of kernel slab allocator memory shown as being consumed (10x other servers), but it's not nearly enough to account for the usage (~2.4GB).

So since it isn't userspace processes, nor the slab allocator, nor anything else shown in meminfo, that basically has to mean that the issue is lower-level than all that. Beneath the slab allocator, and beneath all physical page allocation in Linux, is the buddy allocator. It reports stats in /proc/buddyinfo and /proc/pagetypeinfo. Each row is a kind of flavor of memory (e.g. NUMA node x, relocate-able); each column within a row is the number of free blocks for a given power-of-two number of pages, starting with 2^0==1 page==4 kB, progressing upwards to 2^10==1024 pages==4MByte of RAM.

On a healthy system without fragmentation you'll see something like this:

Node    0, zone   Normal, type      Movable      0      0      0    407    646    189    201    289    290    179   5272

Lots of large, contiguous blocks of pages are free (> 20 GB worth). There aren't many small blocks of free memory, but that's fine, as splitting larger blocks in half is trivial and fast.

On mw1357, it looks like this:

Node    0, zone   Normal, type      Movable 484607 101153   3866      0      0      0      0      0      0      0      0

This is severe internal fragmentation: the available memory exists only in small 4kB/8kB/16kB blocks, and there's no free memory in any of the larger orders of pages.

I tried sudo sysctl -w vm/compact_memory=1 on mw1357, which forces an attempt at coalescing pages, and while that created more free blocks of small orders, it didn't coalesce any small chunks into larger orders.

At this point I was suspicious of many small objects allocated in the kernel causing this excessive fragmentation.

I looked at slabtop output on mw1357 and saw this:

2 Active / Total Slabs (% used) 3 Active / Total Caches (% used) 4 Active / Total Size (% used) 5 Minimum / 6 7 OBJS ACTIVE 81254784 1252990 0% 91096904 1073266 0% 10302253 301941 0% 11289418 289418 100% 12252441 236332 0% 13211238 211215 0% 14198816 69740 0% 15175788 175285 0% 16136360 136042 0% 17 89472 66605 0% 18 87686 64934 0% 19 68608 64147 0% 20 65832 65489 0% 21 60992 47459 0% 22 54820 48303 0% 23 44408 43349 0%

class="paste-embed-head">P12231 ✔️ cdanis@mw1357.eqiad.wmnet ~ 🕔🍵 sudo slabtop -o | phaste

Active / Total Objects (% used) : 4335412 / 4607316 (94.1%) : 601085 / 601100 (100.0%) : 77 / 120 (64.2%) : 2322856.84K / 2358137.30K (98.5%) Average / Maximum Object : 0.02K / 0.51K / 4096.00K USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 0.12K 39212 32 156848K kmalloc-node 0.03K 8846 124 35384K kmalloc-32 0.19K 14393 21 57572K kmem_cache 4.00K 289418 1 1157672K kmalloc-4096 0.19K 12021 21 48084K dentry 2.00K 105619 2 422476K kmalloc-2048 0.12K 6213 32 24852K kmalloc-96 1.05K 58596 3 234384K ext4_inode_cache 1.00K 34090 4 136360K kmalloc-1024 0.06K 1398 64 5592K kmalloc-64 0.12K 2579 34 10316K kernfs_node_cache 0.25K 4288 16 17152K kmalloc-256 0.10K 1688 39 6752K buffer_head 0.06K 953 64 3812K anon_vma_chain 0.20K 2741 20 10964K vm_area_struct 0.07K 793 56 3172K Acpi-Operand

For comparison, let's also look at slabtop on mw1359, another apiserver that was rebooted and has been serving requests for 9.5 hours: P12234

Things that stand out:

There's 1M more objects on mw1357 than mw1359, and there's almost 3x the # of slabs on mw1357
There's many, many more kmalloc-node and kmalloc-32 slabs in use on mw1357, by over an order of magnitude. kmalloc-32 indicates literally calls to kmalloc(32) from within the kernel. There's also a huge increase in kmalloc-4096 nodes, by many orders of magnitude, but I somehow missed this at first.

I did some work to backport bpfcc-tools from Buster to Stretch, so that I could use the memleak tracing tool it provides, which instruments kernel tracepoints and reports stacktraces of still-existent allocations that are older than a specified duration. (More details here and code here.) I installed those packages on mw1359, which is serving traffic, and for about five hours now I've been running memleak-bpfcc -o 60000 -z 31 -Z 33 30, which hooks into kernel tracepoints for kmalloc, and if the allocation was 31<x<33 bytes, saves a stacktrace, and then every 30 seconds, prints out stacktraces of any allocation that has not yet been kfree()d and also is older than 60,000 milliseconds.

Here's the output accumulated over ~5h, started at 16:24: P12237

There's many sets of traces involving memcg_create_kmem_cache and I'm a bit suspicious of it. The set of traces that doesn't involve that, involves kernfs_iop_mkdir.

Anyway, after noticing they are at least as much of a problem as the 32-byte allocations, now I'm going to similarly instrument 4kB allocations and leave that running for some hours.

RhinosF1 subscribed.Aug 12 2020, 9:22 PM

There's some preliminary results from tracing 4096-byte allocations at

P12238 Masterwork From Distant Lands

1	[21:27:29] Top 10 stacks with outstanding allocations:
2	2416640 bytes in 590 allocations from stack
3	__alloc_pages_nodemask+0x1c2 [kernel]
4	__alloc_pages_nodemask+0x1c2 [kernel]
5	alloc_pages_current+0x91 [kernel]
6	pagecache_get_page+0xb4 [kernel]
7	grab_cache_page_write_begin+0x1c [kernel]
8	ext4_da_write_begin+0x9f [ext4]
9	ext4_da_write_end+0x142 [ext4]
10	generic_perform_write+0xcf [kernel]
11	__generic_file_write_iter+0x16a [kernel]
12	ext4_file_write_iter+0x90 [ext4]
13	mem_cgroup_try_charge+0x65 [kernel]
14	new_sync_write+0xe0 [kernel]
15	vfs_write+0xb0 [kernel]
16	sys_write+0x5a [kernel]
17	do_syscall_64+0x8d [kernel]
18	entry_SYSCALL_64_after_swapgs+0x58 [kernel]
19	2609152 bytes in 637 allocations from stack
20	__alloc_pages_nodemask+0x1c2 [kernel]
21	__alloc_pages_nodemask+0x1c2 [kernel]
22	alloc_pages_vma+0xaa [kernel]
23	handle_mm_fault+0x10ff [kernel]
24	new_sync_read+0xdd [kernel]
25	__do_page_fault+0x255 [kernel]
26	page_fault+0x28 [kernel]
27	3035136 bytes in 741 allocations from stack
28	__alloc_pages_nodemask+0x1c2 [kernel]
29	__alloc_pages_nodemask+0x1c2 [kernel]
30	alloc_pages_current+0x91 [kernel]
31	pagecache_get_page+0xb4 [kernel]
32	grab_cache_page_write_begin+0x1c [kernel]
33	ext4_da_write_begin+0x9f [ext4]
34	ext4_da_write_end+0x142 [ext4]
35	generic_perform_write+0xcf [kernel]
36	__generic_file_write_iter+0x16a [kernel]
37	ext4_file_write_iter+0x90 [ext4]
38	perf_trace_run_bpf_submit+0x4a [kernel]
39	new_sync_write+0xe0 [kernel]
40	vfs_write+0xb0 [kernel]
41	sys_write+0x5a [kernel]
42	do_syscall_64+0x8d [kernel]
43	entry_SYSCALL_64_after_swapgs+0x58 [kernel]
44	3244032 bytes in 792 allocations from stack
45	__alloc_pages_nodemask+0x1c2 [kernel]
46	__alloc_pages_nodemask+0x1c2 [kernel]
47	pcpu_populate_chunk+0xbd [kernel]
48	pcpu_balance_workfn+0x59a [kernel]
49	__switch_to_asm+0x41 [kernel]
50	__switch_to_asm+0x35 [kernel]
51	process_one_work+0x18a [kernel]
52	worker_thread+0x4d [kernel]
53	worker_thread+0x0 [kernel]
54	kthread+0xd9 [kernel]
55	kthread+0x0 [kernel]
56	ret_from_fork+0x57 [kernel]
57	4063232 bytes in 992 allocations from stack
58	__alloc_pages_nodemask+0x1c2 [kernel]
59	__alloc_pages_nodemask+0x1c2 [kernel]
60	alloc_pages_vma+0xaa [kernel]
61	handle_mm_fault+0x10ff [kernel]
62	do_nanosleep+0x8d [kernel]
63	__do_page_fault+0x255 [kernel]
64	page_fault+0x28 [kernel]
65	9363456 bytes in 2286 allocations from stack
66	__alloc_pages_nodemask+0x1c2 [kernel]
67	__alloc_pages_nodemask+0x1c2 [kernel]
68	alloc_pages_vma+0xaa [kernel]
69	handle_mm_fault+0x10ff [kernel]
70	__do_page_fault+0x255 [kernel]
71	sys_brk+0x160 [kernel]
72	page_fault+0x28 [kernel]
73	21508096 bytes in 5251 allocations from stack
74	__alloc_pages_nodemask+0x1c2 [kernel]
75	__alloc_pages_nodemask+0x1c2 [kernel]
76	alloc_pages_current+0x91 [kernel]
77	pagecache_get_page+0xb4 [kernel]
78	grab_cache_page_write_begin+0x1c [kernel]
79	ext4_da_write_begin+0x9f [ext4]
80	ext4_da_write_end+0x142 [ext4]
81	generic_perform_write+0xcf [kernel]
82	__generic_file_write_iter+0x16a [kernel]
83	ext4_file_write_iter+0x90 [ext4]
84	do_readv_writev+0x151 [kernel]
85	new_sync_write+0xe0 [kernel]
86	vfs_write+0xb0 [kernel]
87	sys_write+0x5a [kernel]
88	do_syscall_64+0x8d [kernel]
89	entry_SYSCALL_64_after_swapgs+0x58 [kernel]
90	47345664 bytes in 11559 allocations from stack
91	__alloc_pages_nodemask+0x1c2 [kernel]
92	__alloc_pages_nodemask+0x1c2 [kernel]
93	alloc_pages_vma+0xaa [kernel]
94	handle_mm_fault+0x10ff [kernel]
95	__do_page_fault+0x255 [kernel]
96	page_fault+0x28 [kernel]
97	59744256 bytes in 14586 allocations from stack
98	__alloc_pages_nodemask+0x1c2 [kernel]
99	__alloc_pages_nodemask+0x1c2 [kernel]
100	alloc_pages_current+0x91 [kernel]
101	pagecache_get_page+0xb4 [kernel]
102	grab_cache_page_write_begin+0x1c [kernel]
103	ext4_da_write_begin+0x9f [ext4]
104	ext4_da_write_end+0x142 [ext4]
105	generic_perform_write+0xcf [kernel]
106	__generic_file_write_iter+0x16a [kernel]
107	ext4_file_write_iter+0x90 [ext4]
108	new_sync_write+0xe0 [kernel]
109	vfs_write+0xb0 [kernel]
110	sys_write+0x5a [kernel]
111	do_syscall_64+0x8d [kernel]
112	entry_SYSCALL_64_after_swapgs+0x58 [kernel]
113	61636608 bytes in 15048 allocations from stack
114	__alloc_pages_nodemask+0x1c2 [kernel]
115	__alloc_pages_nodemask+0x1c2 [kernel]
116	pcpu_populate_chunk+0xbd [kernel]
117	pcpu_balance_workfn+0x59a [kernel]
118	__switch_to_asm+0x41 [kernel]
119	__switch_to_asm+0x35 [kernel]
120	process_one_work+0x18a [kernel]
121	worker_thread+0x4d [kernel]
122	worker_thread+0x0 [kernel]
123	kthread+0xd9 [kernel]
124	__switch_to_asm+0x41 [kernel]
125	kthread+0x0 [kernel]
126	ret_from_fork+0x57 [kernel]
127

although some of the stack traces make me think I need to make sure the script is tracing only the kmalloc codepaths.

A thing that someone daring in EUTZ might want to try: Using perf probe, or by modifying the bpfcc-memleak script, or by writing a trivial bpftrace script: attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.

I'm not 100% sure that slabs are the problem here, but I'll try to followup later.

In the meantime, the servers we've rebooted yesterday are definitely showing the same behaviour again. This means that in a week we'll have to reboot them all again.

I am therefore:
1 - raising this task to UBN! priority, as this has the potential to disrupt the work of all SREs constantly
2 - focusing on looking at what was changed on August 4th in the morning, mostly in code, and try to rule out stuff. This might result in some reverts.
3 - Trying to build better automation around the task of rebooting a large fleet of servers.

Also adding tags for platform engineering and wikidata as the two potentially worrisome code changes regard Score's enablement and a change to Wikibase.

• ema subscribed.Aug 13 2020, 7:11 AM

JMeybohm subscribed.Aug 13 2020, 7:42 AM

Mentioned in SAL (#wikimedia-operations) [2020-08-13T08:43:31Z] <_joe_> downgrading imagemagick on mw1378 T260281

Ladsgroup subscribed.Aug 13 2020, 9:22 AM

• ema updated the task description. (Show Details)Aug 13 2020, 11:33 AM

taavi subscribed.Aug 13 2020, 11:34 AM

In T260281#6381768, @CDanis wrote:

attach a tracepoint to memcg_schedule_kmem_cache_create and gather calling stacktraces. That's the function that creates the work item that results in a worker thread calling memcg_create_kmem_cache, as seen in the stack traces we saw for 32-byte mallocs.

I've installed systemtap on mw1357 and started instrumenting memcg_create_kmem_cache. The function does not seem to be called at present? Same goes for memcg_create_kmem_cache.

root@mw1357:~# stap -ve 'probe kernel.function("memcg_schedule_kmem_cache_create") { print_backtrace() ; exit() }'
Pass 1: parsed user script and 477 library scripts using 153836virt/86440res/6576shr/80088data kb, in 210usr/20sys/241real ms.
Pass 2: analyzed script: 1 probe, 2 functions, 0 embeds, 0 globals using 195668virt/129392res/7460shr/121920data kb, in 560usr/70sys/620real ms.
Pass 3: using cached /root/.systemtap/cache/8a/stap_8aaac3604a689cbbd6ae1226de2f15db_1317.c
Pass 4: using cached /root/.systemtap/cache/8a/stap_8aaac3604a689cbbd6ae1226de2f15db_1317.ko
Pass 5: starting run.
# ema's note: output stops here

To make sure that nothing basic is wrong with the systemtap setup I looked for icmp_reply, that does print a backtrace.

In T260281#6382529, @ema wrote:

I've installed systemtap on mw1357

Nevermind, I've seen only now that mw1357 is depooled. Here's some preliminary results from mw1359: P12251. Note that the calls happen in bursts but rarely (12:16:30, 12:16:45, 12:17:31). Also note that the traces include calls to perf_trace_run_bpf_submit, it might be wise to run the two different instrumentation tools on different hosts for clarity.

Addshore subscribed.Aug 13 2020, 12:33 PM

node_vmstat_nr_slab_unreclaimable is going up indefinitely on nodes affected by the issue, following a pattern that matches the general memory usage. However, the actual amount of "lost" memory does not match the size of unreclaimable slabs, which is only ~2G on mw1357:

root@mw1357:~# grep SUnreclaim /proc/meminfo 
SUnreclaim:      2097548 kB

Screenshot from 2020-08-13 15-04-23.png (1×2 px, 231 KB)

We discussed two things to try out next:

reboot 4.9 with memory accounting disabled (cgroup.memory=nokmem) to see if we ran into https://lwn.net/ml/linux-kernel/20190611231813.3148843-1-guro@fb.com/, in particular considering that mem_cgroup_try_charge is featured often in the memcg_schedule_kmem_cache_create traces that @CDanis suggested to gather (see P12251 and P12252)
reboot with 4.19 from stretch-backports because it's a low hanging fruit and 4.9 is very old

Mentioned in SAL (#wikimedia-operations) [2020-08-13T14:38:52Z] <ema> reboot mw1382 with kernel memory accounting disabled T260281

Mentioned in SAL (#wikimedia-operations) [2020-08-13T14:45:55Z] <ema> repool mw1382 with kernel memory accounting disabled T260281

hashar awarded a token.Aug 13 2020, 8:03 PM

hashar subscribed.

Platonides subscribed.Aug 13 2020, 10:49 PM

I suggest setting this a security issue since this may cause people to intentionally make memory leaks to damage servers using this software.

In T260281#6385334, @NullPointer wrote:

I suggest setting this a security issue since this may cause people to intentionally make memory leaks to damage servers using this software.

If it's related to Score/Lillypond as speculated, then no third party sight should be running it anyway due to the security issues found,

• NullPointer awarded a token.Aug 15 2020, 11:17 AM

Addshore added a subscriber: Michael.Aug 17 2020, 10:25 AM

Script wmf-auto-reimage was launched by cdanis on cumin1001.eqiad.wmnet for hosts:

mw1359.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202008171607_cdanis_15670_mw1359_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mw1359.eqiad.wmnet']

and were ALL successful.

jijiki moved this task from Incoming 🐫 to 🔦Unused2 on the serviceops board.Aug 17 2020, 11:44 PM

We're tracking this, but unsure as to next steps. Let us know if more active investigation from Platform team is needed.

Joe closed subtask Restricted Task as Resolved.Aug 18 2020, 9:09 PM

Joe claimed this task.Aug 19 2020, 8:40 AM

Reporting here in brief:

We confirmed the problem had to do with activating firejail for all executions of external programs. That triggered a kernel bug
This kernel bug can be bypassed by disabling kernel memory accounting in cgroups, or by moving to a newer kernel like 4.19
For now we've gone through a full rolling reboot of our mediawiki servers to disable kernel memory accounting in cgroups
We might move to 4.19 at a later stage.

The issue is resolved and we're seeing ~ stable memory usage across all clusters, so the issue can be considered resolved.

TLDR for third parties: if you run MediaWiki with $wgShellRestrictionMethod = 'firejail'; you should do so with a relatively recent kernel, 4.19+ or 5.3+ IIRC.

jijiki updated the task description. (Show Details)Aug 20 2020, 9:06 AM

MoritzMuehlenhoff mentioned this in T261193: Make bpfcc-tools available fleet-wide.Aug 25 2020, 9:37 AM

jbond subscribed.Jul 9 2021, 3:14 PM

	F32102914: Screenshot from 2020-08-13 15-04-23.png
	Aug 13 2020, 2:28 PM

mw* servers memory leaks (12 Aug)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

mw* servers memory leaks (12 Aug)
Closed, ResolvedPublic
Actions

Related Objects
Search...