Page MenuHomePhabricator

OOM killer killed varnihsd cache-main on cp3053
Closed, DuplicatePublic

Description

The Memory RSS by Service graph on grafana shows a spike on varnish memory usage at 11:59 up to 352GiB:

image.png (611×1 px, 74 KB)

kern.log
Mar  8 11:59:06 cp3053 kernel: [2177718.395042] traffic_manager invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Mar  8 11:59:06 cp3053 kernel: [2177718.395043] traffic_manager cpuset=/ mems_allowed=0-1
Mar  8 11:59:06 cp3053 kernel: [2177718.395048] CPU: 0 PID: 965 Comm: traffic_manager Not tainted 4.19.0-8-amd64 #1 Debian 4.19.98-1
Mar  8 11:59:06 cp3053 kernel: [2177718.395050] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 2.3.10 08/15/2019
Mar  8 11:59:06 cp3053 kernel: [2177718.395052] Call Trace:
Mar  8 11:59:06 cp3053 kernel: [2177718.395061]  dump_stack+0x66/0x90
Mar  8 11:59:06 cp3053 kernel: [2177718.395065]  dump_header+0x6b/0x283
Mar  8 11:59:06 cp3053 kernel: [2177718.395070]  oom_kill_process.cold.30+0xb/0x1cf
Mar  8 11:59:06 cp3053 kernel: [2177718.395072]  ? oom_badness+0x23/0x140
Mar  8 11:59:06 cp3053 kernel: [2177718.395074]  out_of_memory+0x1a5/0x450
Mar  8 11:59:06 cp3053 kernel: [2177718.395083]  __alloc_pages_slowpath+0xbd8/0xcb0
Mar  8 11:59:06 cp3053 kernel: [2177718.395088]  __alloc_pages_nodemask+0x28b/0x2b0
Mar  8 11:59:06 cp3053 kernel: [2177718.395090]  filemap_fault+0x3bd/0x780
Mar  8 11:59:06 cp3053 kernel: [2177718.395093]  ? alloc_set_pte+0xf2/0x560
Mar  8 11:59:06 cp3053 kernel: [2177718.395098]  ? filemap_map_pages+0x1ed/0x3a0
Mar  8 11:59:06 cp3053 kernel: [2177718.395127]  ext4_filemap_fault+0x2c/0x40 [ext4]
Mar  8 11:59:06 cp3053 kernel: [2177718.395130]  __do_fault+0x36/0x130
Mar  8 11:59:06 cp3053 kernel: [2177718.395133]  __handle_mm_fault+0xe6c/0x1270
Mar  8 11:59:06 cp3053 kernel: [2177718.395136]  ? common_interrupt+0xa/0xf
Mar  8 11:59:06 cp3053 kernel: [2177718.395138]  handle_mm_fault+0xd6/0x200
Mar  8 11:59:06 cp3053 kernel: [2177718.395140]  __do_page_fault+0x249/0x4f0
Mar  8 11:59:06 cp3053 kernel: [2177718.395143]  ? page_fault+0x8/0x30
Mar  8 11:59:06 cp3053 kernel: [2177718.395144]  page_fault+0x1e/0x30
Mar  8 11:59:06 cp3053 kernel: [2177718.395147] RIP: 0033:0x7f12b896a860
Mar  8 11:59:06 cp3053 kernel: [2177718.395151] Code: Bad RIP value.
Mar  8 11:59:06 cp3053 kernel: [2177718.395152] RSP: 002b:00007ffcc90d3c08 EFLAGS: 00010206
Mar  8 11:59:06 cp3053 kernel: [2177718.395153] RAX: 0000000000000000 RBX: 00007ffcc90d3f10 RCX: 000000000000000a
Mar  8 11:59:06 cp3053 kernel: [2177718.395154] RDX: 0000000000000e10 RSI: 0000000000000000 RDI: 0000555583097420
Mar  8 11:59:06 cp3053 kernel: [2177718.395155] RBP: 00007ffcc90d4450 R08: 0000000000000000 R09: 0000000000000000
Mar  8 11:59:06 cp3053 kernel: [2177718.395155] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000e10
Mar  8 11:59:06 cp3053 kernel: [2177718.395156] R13: 0000000000000064 R14: 0000000000000e10 R15: 00007ffcc90d4050
Mar  8 11:59:06 cp3053 kernel: [2177718.395158] Mem-Info:
Mar  8 11:59:06 cp3053 kernel: [2177718.395165] active_anon:95678997 inactive_anon:165370 isolated_anon:0
Mar  8 11:59:06 cp3053 kernel: [2177718.395165]  active_file:0 inactive_file:0 isolated_file:0
Mar  8 11:59:06 cp3053 kernel: [2177718.395165]  unevictable:0 dirty:0 writeback:0 unstable:0
Mar  8 11:59:06 cp3053 kernel: [2177718.395165]  slab_reclaimable:76401 slab_unreclaimable:241171
Mar  8 11:59:06 cp3053 kernel: [2177718.395165]  mapped:53926 shmem:418585 pagetables:339566 bounce:0
Mar  8 11:59:06 cp3053 kernel: [2177718.395165]  free:653414 free_pcp:1412 free_cma:0
Mar  8 11:59:06 cp3053 kernel: [2177718.395168] Node 0 active_anon:192437320kB inactive_anon:154152kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:21944kB
 dirty:0kB writeback:0kB shmem:388536kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2820096kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Mar  8 11:59:06 cp3053 kernel: [2177718.395171] Node 1 active_anon:190278668kB inactive_anon:507328kB active_file:16kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:193760
kB dirty:0kB writeback:0kB shmem:1285804kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 6107136kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Mar  8 11:59:06 cp3053 kernel: [2177718.395176] Node 0 DMA free:15900kB min:84kB low:104kB high:124kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB p
resent:15984kB managed:15900kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395179] lowmem_reserve[]: 0 1393 191851 191851 191851
Mar  8 11:59:06 cp3053 kernel: [2177718.395182] Node 0 DMA32 free:767832kB min:7736kB low:9668kB high:11600kB active_anon:675624kB inactive_anon:76kB active_file:128kB inactive_file:0kB unevictable:0kB wr
itepending:0kB present:1521664kB managed:1456096kB mlocked:0kB kernel_stack:0kB pagetables:104kB bounce:0kB free_pcp:3016kB local_pcp:0kB free_cma:0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395188] lowmem_reserve[]: 0 0 190457 190457 190457
Mar  8 11:59:06 cp3053 kernel: [2177718.395191] Node 0 Normal free:777376kB min:1036316kB low:1295392kB high:1554468kB active_anon:191762756kB inactive_anon:154076kB active_file:0kB inactive_file:212kB un
evictable:0kB writepending:0kB present:198180864kB managed:195032876kB mlocked:0kB kernel_stack:15800kB pagetables:883132kB bounce:0kB free_pcp:2636kB local_pcp:244kB free_cma:0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395195] lowmem_reserve[]: 0 0 0 0 0
Mar  8 11:59:06 cp3053 kernel: [2177718.395201] Node 1 Normal free:1052548kB min:1053012kB low:1316264kB high:1579516kB active_anon:190278668kB inactive_anon:507328kB active_file:16kB inactive_file:0kB un
evictable:0kB writepending:0kB present:201326592kB managed:198175296kB mlocked:0kB kernel_stack:19784kB pagetables:475028kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395205] lowmem_reserve[]: 0 0 0 0 0
Mar  8 11:59:06 cp3053 kernel: [2177718.395214] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395222] Node 0 DMA32: 3445*4kB (UME) 8045*8kB (UME) 5975*16kB (UM) 3841*32kB (UM) 2061*64kB (UME) 892*128kB (UME) 321*256kB (UME) 131*512kB (UM) 75*1024kB (UME) 0*2
048kB 0*4096kB = 768780kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395240] Node 0 Normal: 7192*4kB (UME) 1211*8kB (UME) 3312*16kB (UME) 1604*32kB (UME) 3688*64kB (UME) 1852*128kB (UME) 231*256kB (UME) 97*512kB (M) 51*1024kB (M) 0*2
048kB 0*4096kB = 776888kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395255] Node 1 Normal: 42795*4kB (UME) 17904*8kB (UME) 4967*16kB (UME) 7748*32kB (UME) 1198*64kB (UME) 250*128kB (UME) 136*256kB (ME) 86*512kB (UME) 38*1024kB (ME)
17*2048kB (E) 38*4096kB (U) = 1058716kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395263] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395264] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395265] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395266] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395266] 420233 total pagecache pages
Mar  8 11:59:06 cp3053 kernel: [2177718.395267] 0 pages in swap cache
Mar  8 11:59:06 cp3053 kernel: [2177718.395268] Swap cache stats: add 0, delete 0, find 0/0
Mar  8 11:59:06 cp3053 kernel: [2177718.395269] Free swap  = 0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395269] Total swap = 0kB
Mar  8 11:59:06 cp3053 kernel: [2177718.395270] 100261276 pages RAM
Mar  8 11:59:06 cp3053 kernel: [2177718.395271] 0 pages HighMem/MovableOnly
Mar  8 11:59:06 cp3053 kernel: [2177718.395271] 1591234 pages reserved
Mar  8 11:59:06 cp3053 kernel: [2177718.395272] 0 pages hwpoisoned
Mar  8 11:59:06 cp3053 kernel: [2177718.395272] Tasks state (memory values in pages):
Mar  8 11:59:06 cp3053 kernel: [2177718.395273] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar  8 11:59:06 cp3053 kernel: [2177718.395296] [    738]     0   738    92881    64417   770048        0             0 systemd-journal
Mar  8 11:59:06 cp3053 kernel: [2177718.395298] [    752]     0   752     5614      352    65536        0         -1000 systemd-udevd
Mar  8 11:59:06 cp3053 kernel: [2177718.395302] [    956]     0   956      771       52    49152        0             0 mdadm
Mar  8 11:59:06 cp3053 kernel: [2177718.395304] [   1000]   101  1000    23270      232    81920        0             0 systemd-timesyn
Mar  8 11:59:06 cp3053 kernel: [2177718.395306] [   1029]   108  1029    64413     4343   135168        0             0 prometheus-varn
Mar  8 11:59:06 cp3053 kernel: [2177718.395308] [   1031]     0  1031     2126       67    57344        0             0 cron
Mar  8 11:59:06 cp3053 kernel: [2177718.395310] [   1035]   106  1035     7706      295    90112        0             0 lldpd
Mar  8 11:59:06 cp3053 kernel: [2177718.395312] [   1040]   108  1040    28359     4966   126976        0             0 prometheus-traf
Mar  8 11:59:06 cp3053 kernel: [2177718.395314] [   1046]   106  1046     7707      302    86016        0             0 lldpd
Mar  8 11:59:06 cp3053 kernel: [2177718.395316] [   1051]   105  1051     2331      229    57344        0          -900 dbus-daemon
Mar  8 11:59:06 cp3053 kernel: [2177718.395318] [   1056]     0  1056     4877      265    77824        0             0 systemd-logind
Mar  8 11:59:06 cp3053 kernel: [2177718.395320] [   1079]   108  1079  1322096    10477   745472        0             0 prometheus-node
Mar  8 11:59:06 cp3053 kernel: [2177718.395321] [   1089]     0  1089     2526      369    61440        0             0 smartd
Mar  8 11:59:06 cp3053 kernel: [2177718.395323] [   1090]     0  1090     1359       61    53248        0             0 rasdaemon
Mar  8 11:59:06 cp3053 kernel: [2177718.395325] [   1106]     0  1106  1249684     6365   692224        0             0 confd
Mar  8 11:59:06 cp3053 kernel: [2177718.395327] [   1137]     0  1137     3963      216    69632        0         -1000 sshd
Mar  8 11:59:06 cp3053 kernel: [2177718.395331] [   1203]     0  1203     1346       29    53248        0             0 agetty
Mar  8 11:59:06 cp3053 kernel: [2177718.395332] [   1210]     0  1210     1403       28    45056        0             0 agetty
Mar  8 11:59:06 cp3053 kernel: [2177718.395334] [   1232]   110  1232     5013      211    81920        0             0 exim4
Mar  8 11:59:06 cp3053 kernel: [2177718.395348] [   1249]   111  1249  1172233     7191   630784        0             0 prometheus-varn
Mar  8 11:59:06 cp3053 kernel: [2177718.395349] [   1293]   108  1293    28358     5001   122880        0             0 prometheus-traf
Mar  8 11:59:06 cp3053 kernel: [2177718.395351] [   3836]   114  3836     7632     1232    90112        0             0 vhtcpd
Mar  8 11:59:06 cp3053 kernel: [2177718.395353] [  23927]     0 23927  1363076    17557   892928        0             0 cadvisor
Mar  8 11:59:06 cp3053 kernel: [2177718.395355] [  36422]     0 36422   280811     1936   221184        0             0 rsyslogd
Mar  8 11:59:06 cp3053 kernel: [2177718.395356] [  43522]     0 43522    62531     1512   114688        0             0 traffic_manager
Mar  8 11:59:06 cp3053 kernel: [2177718.395358] [  43524]   116 43524   118909      929   114688        0             0 fifo-log-demux
Mar  8 11:59:06 cp3053 kernel: [2177718.395359] [  43525]   116 43525   118557      929   122880        0             0 fifo-log-demux
Mar  8 11:59:06 cp3053 kernel: [2177718.395361] [  43526]     0 43526      597       17    45056        0             0 sh
Mar  8 11:59:06 cp3053 kernel: [2177718.395364] [  43527]     0 43527      597       16    49152        0             0 atslog-backend
Mar  8 11:59:06 cp3053 kernel: [2177718.395366] [  43528]     0 43528  1282410     7805   684032        0             0 mtail
Mar  8 11:59:06 cp3053 kernel: [2177718.395368] [  43529]     0 43529   118582      908   114688        0             0 fifo-log-tailer
Mar  8 11:59:06 cp3053 kernel: [2177718.395369] [  43559]   116 43559  3265398  1874027 15798272        0             0 [TS_MAIN]
Mar  8 11:59:06 cp3053 kernel: [2177718.395372] [  48852]     0 48852    25316     1657    73728        0             0 varnishd
Mar  8 11:59:06 cp3053 kernel: [2177718.395374] [  48883]   112 48883 196669898 92556442 1361555456        0             0 cache-main
Mar  8 11:59:06 cp3053 kernel: [2177718.395379] [    567]     0   567   175635    26041   344064        0             0 varnishkafka
Mar  8 11:59:06 cp3053 kernel: [2177718.395381] [    568]     0   568     5049     1506    73728        0             0 python3
Mar  8 11:59:06 cp3053 kernel: [2177718.395383] [    570]     0   570     1682       56    53248        0             0 varnishmtail
Mar  8 11:59:06 cp3053 kernel: [2177718.395385] [    571]     0   571     5026     1506    73728        0             0 python3
Mar  8 11:59:06 cp3053 kernel: [2177718.395386] [    572]     0   572    22351    21013   212992        0             0 varnishncsa
Mar  8 11:59:06 cp3053 kernel: [2177718.395388] [    573]     0   573  1319596     8423   700416        0             0 mtail
Mar  8 11:59:06 cp3053 kernel: [2177718.395393] [    619]     0   619    22346    21011   208896        0             0 varnishlog
Mar  8 11:59:06 cp3053 kernel: [2177718.395396] [    621]     0   621    22351    21021   212992        0             0 varnishlog
Mar  8 11:59:06 cp3053 kernel: [2177718.395397] [    965]     0   965    62523      985   110592        0             0 traffic_manager
Mar  8 11:59:06 cp3053 kernel: [2177718.395399] [    967]   116   967   118557      945   114688        0             0 fifo-log-demux
Mar  8 11:59:06 cp3053 kernel: [2177718.395401] [    968]   116   968   118557     1464   122880        0             0 fifo-log-demux
Mar  8 11:59:06 cp3053 kernel: [2177718.395403] [    985]   116   985  1888016   899484  8122368        0             0 [TS_MAIN]
Mar  8 11:59:06 cp3053 kernel: [2177718.395405] [    994]   116   994     3975      255    73728        0             0 traffic_crashlo
Mar  8 11:59:06 cp3053 kernel: [2177718.395409] [  25220]     0 25220     5026     1498    69632        0             0 python3
Mar  8 11:59:06 cp3053 kernel: [2177718.395411] [  25223]     0 25223    21923    20541   217088        0             0 varnishlog
Mar  8 11:59:06 cp3053 kernel: [2177718.395415] [  43511]     0 43511  1129770     5799   610304        0             0 prometheus-rsys
Mar  8 11:59:06 cp3053 kernel: [2177718.395425] Out of memory: Kill process 48883 (cache-main) score 941 or sacrifice child
Mar  8 11:59:06 cp3053 kernel: [2177718.406694] Killed process 48883 (cache-main) total-vm:786679592kB, anon-rss:370143496kB, file-rss:0kB, shmem-rss:82304kB
Mar  8 11:59:07 cp3053 kernel: [2177718.698121]  nvme0n1: p1
Mar  8 11:59:38 cp3053 kernel: [2177750.432291] oom_reaper: reaped process 48883 (cache-main), now anon-rss:0kB, file-rss:0kB, shmem-rss:81996kB

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-03-09T05:34:17Z] <vgutierrez> restart ats-tls, ats-be and varnish-fe on cp3053 to clean up daemon restart alerts - T247195

Vgutierrez triaged this task as Medium priority.Mar 9 2020, 6:46 AM
Vgutierrez moved this task from Triage to Caching on the Traffic board.