Page MenuHomePhabricator

Cassandra/restbase2029-a (and others) oom-killed (kernel)
Closed, ResolvedPublic

Assigned To
Authored By
Eevans
Dec 14 2023, 3:05 PM
Referenced Files
F41619804: image.png
Dec 22 2023, 6:50 PM
F41617841: image.png
Dec 22 2023, 1:02 AM
F41617844: image.png
Dec 22 2023, 1:02 AM
F41617832: image.png
Dec 22 2023, 1:02 AM
F41617776: image.png
Dec 22 2023, 1:02 AM
F41611324: system-a.log-restbase2030.gz
Dec 18 2023, 4:11 PM
F41611322: dmesg.log-restbase2030.gz
Dec 18 2023, 4:11 PM
F41611326: system-b.log-restbase2029.gz
Dec 18 2023, 4:11 PM

Description

At around 2023-12-14T07:39:02, the Cassandra instance restbase2029-a was killed by the kernel (OOM). It was eventually restarted by Puppet and returned to service at approximately 2023-12-14T07:56:54.

This host was recently added as part of a (currently on-going) refresh; It has only been online a few days.

1[Dec14 07:36] ReadStage-2 invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
2[ +0.000007] CPU: 45 PID: 1947633 Comm: ReadStage-2 Not tainted 5.10.0-26-amd64 #1 Debian 5.10.197-1
3[ +0.000002] Hardware name: Dell Inc. PowerEdge R450/073H50, BIOS 1.11.2 08/10/2023
4[ +0.000001] Call Trace:
5[ +0.000012] dump_stack+0x6b/0x83
6[ +0.000005] dump_header+0x4a/0x1f4
7[ +0.000002] oom_kill_process.cold+0xb/0x10
8[ +0.000009] out_of_memory+0x1bd/0x4e0
9[ +0.000006] __alloc_pages_slowpath.constprop.0+0xbcc/0xc90
10[ +0.000003] __alloc_pages_nodemask+0x2de/0x310
11[ +0.000005] alloc_page_interleave+0x13/0x70
12[ +0.000004] pagecache_get_page+0x175/0x390
13[ +0.000002] filemap_fault+0x6a2/0x900
14[ +0.000006] ? xas_load+0x5/0x80
15[ +0.000049] ext4_filemap_fault+0x2d/0x50 [ext4]
16[ +0.000004] __do_fault+0x34/0x170
17[ +0.000002] handle_mm_fault+0x124d/0x1c00
18[ +0.000007] do_user_addr_fault+0x1b8/0x400
19[ +0.000006] exc_page_fault+0x78/0x160
20[ +0.000007] ? asm_exc_page_fault+0x8/0x30
21[ +0.000002] asm_exc_page_fault+0x1e/0x30
22[ +0.000004] RIP: 0033:0x7f4173fe6228
23[ +0.000004] Code: 66 90 89 84 24 00 c0 fe ff 55 48 83 ec 30 44 8b 56 18 44 8b 46 1c 45 2b c2 41 83 f8 02 7c 2d 4c 8b 5e 10 0f b6 6e 2a 4d 63 c2 <43> 0f bf 04 03 41 83 c2 02 44 89 56 18 85 ed 75 34 0f c8 c1 f8 10
24[ +0.000001] RSP: 002b:00007f415befbba0 EFLAGS: 00010206
25[ +0.000003] RAX: 00000007c04b7fa0 RBX: 00007f40fa09b1a0 RCX: 000000000000003c
26[ +0.000001] RDX: 00000000ffffffe0 RSI: 000000068c7955f0 RDI: 00007f414aa63caa
27[ +0.000001] RBP: 0000000000000000 R08: 00000000001bea73 R09: 000000068c7955f0
28[ +0.000001] R10: 00000000001bea73 R11: 00007f3e6753a593 R12: 0000000000000000
29[ +0.000002] R13: 00000000ffffffe0 R14: 000000068c7955b8 R15: 00007f40ebe78000
30[ +0.000002] Mem-Info:
31[ +0.000013] active_anon:8992390 inactive_anon:11254972 isolated_anon:0
32 active_file:7441 inactive_file:648 isolated_file:0
33 unevictable:10658092 dirty:0 writeback:52
34 slab_reclaimable:639587 slab_unreclaimable:168059
35 mapped:14226 shmem:42840 pagetables:792145 bounce:0
36 free:189277 free_pcp:393 free_cma:0
37[ +0.000003] Node 0 active_anon:6454492kB inactive_anon:33721376kB active_file:11844kB inactive_file:2600kB unevictable:21380872kB isolated(anon):0kB isolated(file):0kB mapped:31048kB dirty:0kB writeback:124kB shmem:93424kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 12144640kB writeback_tmp:0kB kernel_stack:35136kB all_unreclaimable? no
38[ +0.000003] Node 1 active_anon:29515068kB inactive_anon:11298512kB active_file:17920kB inactive_file:0kB unevictable:21251496kB isolated(anon):0kB isolated(file):0kB mapped:25856kB dirty:0kB writeback:84kB shmem:77936kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 9302016kB writeback_tmp:0kB kernel_stack:26880kB all_unreclaimable? no
39[ +0.000003] Node 0 DMA free:11800kB min:8kB low:20kB high:32kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15896kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
40[ +0.000003] lowmem_reserve[]: 0 1384 63846 63846 63846
41[ +0.000005] Node 0 DMA32 free:252856kB min:972kB low:2388kB high:3804kB reserved_highatomic:2048KB active_anon:11660kB inactive_anon:877124kB active_file:8kB inactive_file:0kB unevictable:197368kB writepending:0kB present:1519588kB managed:1454052kB mlocked:197368kB pagetables:1916kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
42[ +0.000003] lowmem_reserve[]: 0 0 62461 62461 62461
43[ +0.000005] Node 0 Normal free:197508kB min:240472kB low:304432kB high:368392kB reserved_highatomic:2048KB active_anon:6442832kB inactive_anon:32844252kB active_file:11840kB inactive_file:3112kB unevictable:21183504kB writepending:0kB present:65011712kB managed:63960996kB mlocked:21183504kB pagetables:1500212kB bounce:0kB free_pcp:792kB local_pcp:144kB free_cma:0kB
44[ +0.000004] lowmem_reserve[]: 0 0 0 0 0
45[ +0.000008] Node 1 Normal free:294944kB min:45260kB low:111260kB high:177260kB reserved_highatomic:2048KB active_anon:29515068kB inactive_anon:11298512kB active_file:17920kB inactive_file:0kB unevictable:21251496kB writepending:36kB present:67108864kB managed:66007432kB mlocked:21251496kB pagetables:1666452kB bounce:0kB free_pcp:864kB local_pcp:60kB free_cma:0kB
46[ +0.000004] lowmem_reserve[]: 0 0 0 0 0
47[ +0.000003] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB
48[ +0.000013] Node 0 DMA32: 1731*4kB (UMEH) 1777*8kB (UMEH) 669*16kB (UMEH) 642*32kB (UMEH) 372*64kB (UMEH) 187*128kB (UMEH) 121*256kB (UMEH) 70*512kB (UME) 84*1024kB (UM) 0*2048kB 0*4096kB = 252964kB
49[ +0.000013] Node 0 Normal: 1249*4kB (UMEH) 11110*8kB (UMEH) 4343*16kB (UEH) 1287*32kB (UEH) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 204612kB
50[ +0.000007] Node 1 Normal: 6274*4kB (UMEH) 16185*8kB (UMEH) 6598*16kB (UEH) 1205*32kB (UEH) 3*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 298896kB
51[ +0.000022] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
52[ +0.000001] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
53[ +0.000001] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
54[ +0.000001] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
55[ +0.000000] 65063 total pagecache pages
56[ +0.000003] 3881 pages in swap cache
57[ +0.000002] Swap cache stats: add 1024128, delete 1020326, find 528702/716598
58[ +0.000001] Free swap = 0kB
59[ +0.000001] Total swap = 975868kB
60[ +0.000001] 33414036 pages RAM
61[ +0.000000] 0 pages HighMem/MovableOnly
62[ +0.000001] 554442 pages reserved
63[ +0.000001] 0 pages hwpoisoned
64[ +0.000001] Tasks state (memory values in pages):
65[ +0.000001] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
66[ +0.000058] [ 827] 0 827 20850 827 184320 120 -250 systemd-journal
67[ +0.000004] [ 849] 0 849 5563 757 65536 76 -1000 systemd-udevd
68[ +0.000005] [ 1074] 0 1074 842 205 45056 17 0 mdadm
69[ +0.000004] [ 1116] 105 1116 22109 256 77824 2 0 systemd-timesyn
70[ +0.000002] [ 1162] 0 1162 1496216 12353 888832 3477 0 cadvisor
71[ +0.000002] [ 1169] 0 1169 1686 565 49152 1 0 cron
72[ +0.000003] [ 1175] 104 1175 2098 496 53248 6 -900 dbus-daemon
73[ +0.000003] [ 1189] 0 1189 21215 189 61440 237 0 ipmiseld
74[ +0.000002] [ 1191] 106 1191 4651 776 73728 4 0 lldpd
75[ +0.000003] [ 1194] 110 1194 43741 3522 102400 48 0 python3
76[ +0.000002] [ 1196] 106 1196 4651 241 69632 4 0 lldpd
77[ +0.000003] [ 1197] 110 1197 955619 4184 528384 3604 0 prometheus-ipmi
78[ +0.000002] [ 1200] 0 1200 1369 461 53248 1 0 rasdaemon
79[ +0.000003] [ 1203] 0 1203 2953 759 65536 7 0 smartd
80[ +0.000003] [ 1213] 0 1213 3634 561 69632 10 0 systemd-logind
81[ +0.000003] [ 1218] 113 1218 1486 516 53248 1 0 ulogd
82[ +0.000002] [ 1235] 0 1235 3338 674 65536 3 -1000 sshd
83[ +0.000003] [ 1255] 0 1255 1461 391 53248 0 0 agetty
84[ +0.000002] [ 1304] 0 1304 1369 504 49152 32 0 agetty
85[ +0.000003] [ 1349] 109 1349 4650 301 77824 3 0 exim4
86[ +0.000002] [ 1352] 0 1352 3854 521 73728 7 0 systemd
87[ +0.000002] [ 1355] 0 1355 41818 619 98304 178 0 (sd-pam)
88[ +0.000003] [ 1479] 0 1479 1418533 7042 761856 100 0 confd
89[ +0.000002] [ 2263] 499 2263 3855 542 65536 2 0 systemd
90[ +0.000003] [ 2264] 499 2264 41818 598 102400 199 0 (sd-pam)
91[ +0.000004] [ 215141] 11774 215141 3856 135 65536 221 0 systemd
92[ +0.000002] [ 215143] 11774 215143 41856 332 102400 524 0 (sd-pam)
93[ +0.000003] [1103991] 110 1103991 2432582 13081 1323008 3327 0 prometheus-node
94[ +0.000005] [1523632] 114 1523632 4102 940 61440 184 0 python3
95[ +0.000003] [1523638] 114 1523638 836258 53113 1163264 380 0 envoy
96[ +0.000003] [1523654] 0 1523654 867249 4036 487424 191 0 rsyslogd
97[ +0.000002] [1527089] 0 1527089 2161 600 57344 3 -500 nrpe
98[ +0.000027] [1530808] 498 1530808 1387 439 49152 104 0 firejail
99[ +0.000003] [1530810] 498 1530810 1390 453 49152 116 0 firejail
100[ +0.000004] [1530830] 498 1530830 230376 11076 1785856 1507 0 nodejs
101[ +0.000003] [1535456] 115 1535456 187368136 5349014 1055092736 112166 0 java
102[ +0.000003] [1947254] 115 1947254 165026957 4999859 743841792 32206 0 java
103[ +0.000010] [2167712] 498 2167712 466424 245266 10825728 733 0 node
104[ +0.000003] [2167728] 498 2167728 471943 256217 11067392 1133 0 node
105[ +0.000002] [2167825] 498 2167825 466250 246813 11055104 498 0 node
106[ +0.000002] [2167829] 498 2167829 491933 263580 11350016 7403 0 node
107[ +0.000003] [2167845] 498 2167845 464195 246454 10776576 210 0 node
108[ +0.000009] [2168000] 498 2168000 471585 248828 11059200 715 0 node
109[ +0.000015] [2168010] 498 2168010 433672 213794 10686464 865 0 node
110[ +0.000009] [2168022] 498 2168022 517528 287282 11493376 635 0 node
111[ +0.000003] [2168131] 498 2168131 500395 283896 11231232 712 0 node
112[ +0.000002] [2168143] 498 2168143 469079 248697 11071488 1471 0 node
113[ +0.000002] [2168158] 498 2168158 507745 280501 11272192 233 0 node
114[ +0.000002] [2168219] 498 2168219 472406 253401 11157504 984 0 node
115[ +0.000006] [2168226] 498 2168226 507261 290541 11296768 1221 0 node
116[ +0.000004] [2168243] 498 2168243 506479 276106 11288576 1691 0 node
117[ +0.000006] [2168297] 498 2168297 487157 257163 11251712 1668 0 node
118[ +0.000005] [2168350] 498 2168350 485062 260971 10866688 326 0 node
119[ +0.000009] [2168380] 498 2168380 481668 256102 10903552 3101 0 node
120[ +0.000009] [2169963] 115 2169963 149513155 4867441 767066112 15938 0 java
121[ +0.000009] [2171928] 498 2171928 504180 272276 11350016 4197 0 node
122[ +0.000010] [2177957] 498 2177957 477712 250209 10760192 2352 0 node
123[ +0.000009] [2197191] 498 2197191 449542 229138 10465280 2455 0 node
124[ +0.000011] [2201841] 498 2201841 513776 293486 11198464 1005 0 node
125[ +0.000011] [2202814] 498 2202814 459962 239748 10698752 1618 0 node
126[ +0.000011] [2637305] 498 2637305 482167 262073 10911744 296 0 node
127[ +0.000009] [2637306] 498 2637306 466387 238854 10727424 2777 0 node
128[ +0.000007] [2637325] 498 2637325 456391 233586 10719232 1151 0 node
129[ +0.000003] [2637327] 498 2637327 475114 244669 10993664 440 0 node
130[ +0.000003] [2637340] 498 2637340 464500 244244 10395648 1763 0 node
131[ +0.000003] [2637358] 498 2637358 480203 255528 10993664 284 0 node
132[ +0.000002] [2637368] 498 2637368 492600 266495 11018240 1419 0 node
133[ +0.000003] [2637376] 498 2637376 722106 481947 20185088 1706 0 node
134[ +0.000002] [2637386] 498 2637386 460792 238764 10747904 1087 0 node
135[ +0.000003] [2637388] 498 2637388 510913 286403 11264000 722 0 node
136[ +0.000002] [2637401] 498 2637401 485055 262631 11202560 2135 0 node
137[ +0.000003] [2637418] 498 2637418 454320 241039 10993664 126 0 node
138[ +0.000003] [2637423] 498 2637423 527441 304036 11612160 63 0 node
139[ +0.000003] [2637449] 498 2637449 485716 268196 11272192 868 0 node
140[ +0.000010] [2637486] 498 2637486 473869 257401 11182080 537 0 node
141[ +0.000009] [2637507] 498 2637507 468617 243185 10973184 749 0 node
142[ +0.000008] [2637549] 498 2637549 466817 239098 11083776 694 0 node
143[ +0.000008] [2637592] 498 2637592 466898 246853 10907648 220 0 node
144[ +0.000007] [2637605] 498 2637605 469777 245670 11014144 1515 0 node
145[ +0.000009] [2637607] 498 2637607 473757 256467 11001856 285 0 node
146[ +0.000004] [2854469] 498 2854469 492895 274333 11292672 70 0 node
147[ +0.000003] [2904938] 498 2904938 479045 255341 11194368 244 0 node
148[ +0.000005] [2988539] 498 2988539 440872 217748 10661888 540 0 node
149[ +0.000007] [3042631] 498 3042631 464942 240412 10903552 218 0 node
150[ +0.000009] [3066068] 498 3066068 435791 212620 10817536 352 0 node
151[ +0.000003] [3093511] 498 3093511 469193 250839 10895360 968 0 node
152[ +0.000009] [3100079] 498 3100079 452255 241082 10543104 309 0 node
153[ +0.000007] [3103226] 498 3103226 423421 210322 10588160 154 0 node
154[ +0.000009] [3276729] 498 3276729 436932 215825 10498048 357 0 node
155[ +0.000008] [3311694] 498 3311694 430201 209912 10661888 88 0 node
156[ +0.000009] [3343248] 498 3343248 451057 232825 10330112 33 0 node
157[ +0.000004] [3365411] 498 3365411 438962 215973 10571776 17 0 node
158[ +0.000003] [3365422] 498 3365422 443413 214794 10661888 6359 0 node
159[ +0.000003] [3439458] 498 3439458 412234 197818 10629120 185 0 node
160[ +0.000003] [3688560] 498 3688560 472065 258495 10797056 119 0 node
161[ +0.000002] [3743542] 498 3743542 395875 179137 10059776 90 0 node
162[ +0.000003] [3822927] 498 3822927 372708 153158 10166272 46 0 node
163[ +0.000002] [3822959] 498 3822959 371737 150981 10051584 448 0 node
164[ +0.000008] [3823017] 498 3823017 375957 157795 10375168 644 0 node
165[ +0.000009] [3832928] 498 3832928 412391 193376 10780672 770 0 node
166[ +0.000002] [3833199] 498 3833199 363063 143985 9474048 762 0 node
167[ +0.000004] [3966752] 498 3966752 415068 194437 10231808 14 0 node
168[ +0.000008] [4019421] 0 4019421 1156890 2906 606208 50 0 prometheus-rsys
169[ +0.000040] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/cassandra-a.service,task=java,pid=1535456,uid=115
170[ +0.002400] Out of memory: Killed process 1535456 (java) total-vm:749472544kB, anon-rss:21396056kB, file-rss:0kB, shmem-rss:0kB, UID:115 pgtables:1030364kB oom_score_adj:0

Something else of interest in the dmesg output are errors from the driver for the SAS controller. There are several instances of these spanning days in the full output (attached), but an example of these is below.

1[Dec12 23:10] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
2[ +0.318874] mpt3sas_cm0: _base_display_fwpkg_version: complete
3[ +0.000008] mpt3sas_cm0: FW Package Ver(24.15.10.00)
4[ +0.000750] mpt3sas_cm0: SAS3816: FWVersion(24.15.03.00), ChipRevision(0x00), BiosVersion(09.47.01.00)
5[ +0.000002] NVMe
6[ +0.000002] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Diag Trace Buffer,Task Set Full,NCQ)
7[ +0.000199] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
8[ +0.000115] mpt3sas_cm0: performance mode: balanced
9[ +0.000016] mpt3sas_cm0: sending port enable !!
10[ +9.422469] mpt3sas_cm0: port enable: SUCCESS
11[ +0.000265] mpt3sas_cm0: search for end-devices: start
12[ +0.000986] scsi target0:0:3: handle(0x0012), sas_addr(0x3f4fe0806260f508)
13[ +0.000004] scsi target0:0:3: enclosure logical id(0x3f4ee08062092108), slot(8)
14[ +0.000061] scsi target0:0:1: handle(0x0017), sas_addr(0x3f4ee0806260f50d)
15[ +0.000003] scsi target0:0:1: enclosure logical id(0x3f4ee08062092108), slot(2)
16[ +0.000061] scsi target0:0:2: handle(0x0018), sas_addr(0x3f4ee0806260f50e)
17[ +0.000002] scsi target0:0:2: enclosure logical id(0x3f4ee08062092108), slot(1)
18[ +0.000003] handle changed from(0x0019)!!!
19[ +0.000061] scsi target0:0:0: handle(0x0019), sas_addr(0x3f4ee0806260f50f)
20[ +0.000002] scsi target0:0:0: enclosure logical id(0x3f4ee08062092108), slot(0)
21[ +0.000002] handle changed from(0x0018)!!!
22[ +0.000062] mpt3sas_cm0: search for end-devices: complete
23[ +0.000002] mpt3sas_cm0: search for end-devices: start
24[ +0.000001] mpt3sas_cm0: search for PCIe end-devices: complete
25[ +0.000002] mpt3sas_cm0: search for expanders: start
26[ +0.000002] mpt3sas_cm0: search for expanders: complete
27[ +0.000012] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
28[ +0.000002] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
29[ +0.000013] mpt3sas_cm0: removing unresponding devices: start
30[ +0.000005] mpt3sas_cm0: removing unresponding devices: end-devices
31[ +0.000003] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
32[ +0.000003] mpt3sas_cm0: removing unresponding devices: expanders
33[ +0.000002] mpt3sas_cm0: removing unresponding devices: complete
34[ +0.000007] mpt3sas_cm0: scan devices: start
35[ +0.000356] mpt3sas_cm0: scan devices: expanders start
36[ +0.000066] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
37[ +0.000003] mpt3sas_cm0: scan devices: expanders complete
38[ +0.000003] mpt3sas_cm0: scan devices: end devices start
39[ +0.001232] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
40[ +0.000003] mpt3sas_cm0: scan devices: end devices complete
41[ +0.000002] mpt3sas_cm0: scan devices: pcie end devices start
42[ +0.000058] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0022), loginfo(0x310f0400)
43[ +0.000002] mpt3sas_cm0: pcie devices: pcie end devices complete
44[ +0.000002] mpt3sas_cm0: scan devices: complete
45[ +0.122612] sd 0:0:1:0: Power-on or device reset occurred
46[ +0.000040] sd 0:0:2:0: Power-on or device reset occurred
47[ +0.000179] sd 0:0:0:0: Power-on or device reset occurred



Update:

This has continued to happen —a total of seven nine times so far. Thus far it has only happened to the recently added Dell r450s (see: T352468), and it has only happened once per instance. Thus far it has only happened to the nodes in rack (row) b.

hostoomsrack
restbase2028-a✔✔✔b
restbase2028-b✔✔b
restbase2028-cb
restbase2029-a✔✔b
restbase2029-bb
restbase2029-cb
restbase2030-ab
restbase2030-bb
restbase2030-cb
restbase2031-ac
restbase2031-bc
restbase2031-cc
restbase2032-ac
restbase2032-bc
restbase2032-cc

Event Timeline

Eevans triaged this task as Medium priority.Dec 14 2023, 3:29 PM
Eevans updated the task description. (Show Details)

Over the weekend, there were four additional instances kill by the kernel (OOM). Like the original, these are all on the new batch of hardware (Dell R450s), and all are logging resets on their SAS controllers.

restbase2028-a2023-12-16T09:08:31
restbase2028-b2023-12-17T02:03:59
restbase2029-b2023-12-17T03:49:29
restbase2030-a2023-12-17T11:58:11

And two more to add to the list:

restbase2030-c2023-12-18T03:10:22...
restbase2028-c2023-12-19T15:40:04...

One interesting thing of note, of the 7 examples of this happening thus far, each as been a unique instance; No single instance has experienced this twice.

restbase2028-a
restbase2028-b
restbase2028-c
restbase2029-a
restbase2029-b
restbase2029-c
restbase2030-a
restbase2030-b
restbase2030-c
restbase2031-a
restbase2031-b
restbase2031-c
restbase2032-a
restbase2032-b
restbase2032-c
Eevans renamed this task from Cassandra/restbase2029-a oom-killed (kernel) to Cassandra/restbase2029-a (and others) oom-killed (kernel).Dec 19 2023, 7:35 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
[Thu Dec 21 11:29:36 2023] Out of memory: Killed process 2690640 (java) total-vm:983367232kB, anon-rss:19016436kB, file-rss:0kB, shmem-rss:0kB, UID:115 pgtables:850624kB oom_score_adj:0

Another OOMKill for cassandra-a on restbase2028

Mentioned in SAL (#wikimedia-operations) [2023-12-21T11:37:50Z] <claime> Manually restarted cassandra-a service on restbase2028 following OOM - T353456

I had paused the hardware refresh after discovering this, but in light of the fact that we haven't had any OOMKills in row c, I've started bootstrapping the final three instances (on restbase2033). I'm interested in seeing if this remains isolated to b, even after achieving parity there.

Eevans raised the priority of this task from Medium to High.Dec 22 2023, 1:02 AM
Eevans added a subscriber: hnowlan.

The graph below is RSS associated with the restbase.service unit in codfw for the last week. The red, purple, and yellow lines hovering around 60GB are for restbase2028, restbase2029, and restbase2030 respectively.

image.png (771×1 px, 256 KB)
codfw/restbase

Here is eqiad for comparison:

image.png (771×1 px, 262 KB)
eqiad/restbase

Here are plots for Cassandra covering the last four weeks:

image.png (771×1 px, 468 KB)
codfw/cassandra
image.png (771×1 px, 553 KB)
eqiad/cassandra

Cassandra doesn't appear to be the culprit here; It's RESTBase memory that is aberrant. RESTBase forks workers on startup, so total memory consumption is distributed over as many PIDs. However, once memory becomes constrained, it ends up being one of the three Cassandra processes that presents as the lowest hanging fruit.

The reason this is only manifesting on the new nodes is that RESTBase is configured to launch as many workers as there are cpus, for the new nodes this is 64, for the older ones it is 40.

@hnowlan I realize now is not the time for deployments (and tomorrow being Friday only makes this worse), but do you think we could make an exception and deploy a configuration change? The most conservative change would be to set num_workers to 40 (to match the others nodes). If we don't, I think we can expect these OOMkills to recur throughout the holiday break. :(

Change 985116 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/restbase/deploy@master] Make num_workers configurable. Reduce workers in prod.

https://gerrit.wikimedia.org/r/985116

Change 985154 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] RESTBase: Hardcode no_worker patch to stop OOM

https://gerrit.wikimedia.org/r/985154

Change 985154 merged by Alexandros Kosiaris:

[operations/puppet@production] RESTBase: Hardcode no_worker patch to stop OOM

https://gerrit.wikimedia.org/r/985154

Change 985116 abandoned by Jgiannelos:

[mediawiki/services/restbase/deploy@master] Make num_workers configurable. Reduce workers in prod.

Reason:

https://gerrit.wikimedia.org/r/985116

Change 985116 restored by Jgiannelos:

[mediawiki/services/restbase/deploy@master] Make num_workers configurable. Reduce workers in prod.

https://gerrit.wikimedia.org/r/985116

Change 985159 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/restbase/deploy@master] Use num_workers var in config template

https://gerrit.wikimedia.org/r/985159

Change 985116 abandoned by Jgiannelos:

[mediawiki/services/restbase/deploy@master] Make num_workers configurable. Reduce workers in prod.

Reason:

https://gerrit.wikimedia.org/r/985116

Change 985159 merged by Jgiannelos:

[mediawiki/services/restbase/deploy@master] Use num_workers var in config template

https://gerrit.wikimedia.org/r/985159

Change 985161 had a related patch set uploaded (by Eevans; author: Eevans):

[mediawiki/services/restbase/deploy@master] Add new (refreshed) hosts to targets

https://gerrit.wikimedia.org/r/985161

Thanks for taking care of this @Jgiannelos! Unfortunately, it looks like targets didn't get updated for the newly added hosts (the ones where the OOMkills are happening), and so the change wasn't deployed there. See: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/985161

Can we run the deploy again (after merging r985161)?

Change 985161 merged by Eevans:

[mediawiki/services/restbase/deploy@master] Add new (refreshed) hosts to targets

https://gerrit.wikimedia.org/r/985161

Change 985163 had a related patch set uploaded (by Eevans; author: Eevans):

[mediawiki/services/restbase/deploy@master] Remove restbase203[4-5] (not yet ready for deploy)

https://gerrit.wikimedia.org/r/985163

Change 985163 merged by Eevans:

[mediawiki/services/restbase/deploy@master] Remove restbase203[4-5] (not yet ready for deploy)

https://gerrit.wikimedia.org/r/985163

Eevans claimed this task.

Memory consumption is now consistent with the rest of the cluster; Closing this as done.

image.png (771×1 px, 226 KB)
codfw/restbase