Page MenuHomePhabricator

Post pc1013 crash
Closed, DuplicatePublic

Description

pc1013 silently crashed with no alarming or icinga or prometheus, despite its notifications being enabled. P69388 contains the notes taken going through the server events and logs

  • diagnosis P69389
  • create a monitoring improvement task T373037
  • upgrade and reinsert in production pool as a spare in former stead of pc1015 T376387

Event Timeline

ABran-WMF changed the task status from Open to In Progress.Mon, Sep 23, 12:11 PM
ABran-WMF triaged this task as Medium priority.
ABran-WMF moved this task from Triage to In progress on the DBA board.
ABran-WMF updated the task description. (Show Details)
ABran-WMF added a subscriber: Marostegui.

Mentioned in SAL (#wikimedia-operations) [2024-09-23T12:12:56Z] <jynus> restarting replication on pc1013 after crash T375382

It appears it was a hw error on memory leading to an uncorrectable memory error, leading to killing mysql:

1[Mon Sep 23 10:49:40 2024] hrtimer: interrupt took 5564317359 ns
2[Mon Sep 23 10:49:41 2024] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 506.505 msecs
3[Mon Sep 23 10:49:43 2024] INFO: NMI handler (ghes_notify_nmi) took too long to run: 1003.720 msecs
4[Mon Sep 23 10:49:46 2024] perf: interrupt took too long (3963905 > 7733), lowering kernel.perf_event_max_sample_rate to 250
5[Mon Sep 23 10:49:55 2024] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
6[Mon Sep 23 10:49:56 2024] rcu: 0-....: (62 ticks this GP) idle=26fc/1/0x4000000000000000 softirq=2442529261/2442529261 fqs=26
7[Mon Sep 23 10:49:56 2024] rcu: 2-....: (109 ticks this GP) idle=db84/1/0x4000000000000000 softirq=2340670043/2340670043 fqs=26
8[Mon Sep 23 10:49:56 2024] (detected by 23, t=5455 jiffies, g=8359567565, q=879 ncpus=32)
9[Mon Sep 23 10:49:56 2024] Sending NMI from CPU 23 to CPUs 0:
10[Mon Sep 23 10:49:56 2024] NMI backtrace for cpu 0
11[Mon Sep 23 10:49:56 2024] CPU: 0 PID: 15 Comm: ksoftirqd/0 Not tainted 6.1.0-13-amd64 #1 Debian 6.1.55-1
12[Mon Sep 23 10:49:56 2024] Hardware name: Dell Inc. PowerEdge R440/04JN2K, BIOS 2.11.2 04/22/2021
13[Mon Sep 23 10:49:56 2024] RIP: 0010:enqueue_hrtimer+0x5/0x80
14[Mon Sep 23 10:49:56 2024] Code: 49 c6 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 49 89 44 24 28 eb 82 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 <55> 48 89 fd 53 48 89 f3 66 90 8b 4b 08 48 8b 13 b8 01 00 00 00 48
15[Mon Sep 23 10:49:56 2024] RSP: 0018:ffffb72180003f28 EFLAGS: 00000046
16[Mon Sep 23 10:49:56 2024] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 005e961a15735e00
17[Mon Sep 23 10:49:56 2024] RDX: 0000000000000000 RSI: ffff9f7c80e23040 RDI: ffff9f7c80e23580
18[Mon Sep 23 10:49:56 2024] RBP: ffff9f7c80e23000 R08: 005e961a15b06700 R09: 7fffffffffffffff
19[Mon Sep 23 10:49:56 2024] R10: 005e961a15735e00 R11: 0000000002a70604 R12: 0000000000000082
20[Mon Sep 23 10:49:56 2024] R13: ffff9f7c80e23060 R14: ffff9f7c80e23580 R15: ffff9f7c80e23040
21[Mon Sep 23 10:49:56 2024] FS: 0000000000000000(0000) GS:ffff9f7c80e00000(0000) knlGS:0000000000000000
22[Mon Sep 23 10:49:56 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
23[Mon Sep 23 10:49:57 2024] CR2: 00007f459638802a CR3: 0000005462010004 CR4: 00000000007706f0
24[Mon Sep 23 10:49:57 2024] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
25[Mon Sep 23 10:49:57 2024] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
26[Mon Sep 23 10:49:57 2024] PKRU: 55555554
27[Mon Sep 23 10:49:57 2024] Call Trace:
28[Mon Sep 23 10:49:57 2024] <NMI>
29[Mon Sep 23 10:49:57 2024] ? nmi_cpu_backtrace.cold+0x1b/0x76
30[Mon Sep 23 10:49:57 2024] ? nmi_cpu_backtrace_handler+0xd/0x20
31[Mon Sep 23 10:49:57 2024] ? nmi_handle+0x5a/0x120
32[Mon Sep 23 10:49:57 2024] ? enqueue_hrtimer+0x5/0x80
33[Mon Sep 23 10:49:57 2024] ? default_do_nmi+0x69/0x170
34[Mon Sep 23 10:49:57 2024] ? exc_nmi+0x13c/0x170
35[Mon Sep 23 10:49:57 2024] ? end_repeat_nmi+0x16/0x67
36[Mon Sep 23 10:49:57 2024] ? enqueue_hrtimer+0x5/0x80
37[Mon Sep 23 10:49:57 2024] ? enqueue_hrtimer+0x5/0x80
38[Mon Sep 23 10:49:57 2024] ? enqueue_hrtimer+0x5/0x80
39[Mon Sep 23 10:49:57 2024] </NMI>
40[Mon Sep 23 10:49:57 2024] <IRQ>
41[Mon Sep 23 10:49:57 2024] __hrtimer_run_queues+0x141/0x2b0
42[Mon Sep 23 10:49:57 2024] hrtimer_interrupt+0xfe/0x220
43[Mon Sep 23 10:49:57 2024] __sysvec_apic_timer_interrupt+0x7c/0x170
44[Mon Sep 23 10:49:57 2024] sysvec_apic_timer_interrupt+0x99/0xc0
45[Mon Sep 23 10:49:57 2024] </IRQ>
46[Mon Sep 23 10:49:57 2024] <TASK>
47[Mon Sep 23 10:49:57 2024] asm_sysvec_apic_timer_interrupt+0x16/0x20
48[Mon Sep 23 10:49:57 2024] RIP: 0010:fq_codel_enqueue+0x144/0x3e0
49[Mon Sep 23 10:49:57 2024] Code: 00 00 49 83 3e 00 0f 84 11 02 00 00 49 8b 46 08 4c 89 28 4d 89 6e 08 41 8b 55 28 49 c7 45 00 00 00 00 00 48 8b 83 98 01 00 00 <42> 01 14 b8 41 8b 45 28 4d 8d 7e 10 01 83 c4 00 00 00 49 8b 46 10
50[Mon Sep 23 10:49:57 2024] RSP: 0018:ffffb7218026f5e0 EFLAGS: 00000246
51[Mon Sep 23 10:49:57 2024] RAX: ffff9f7dc8d7d000 RBX: ffff9f7dc8d70000 RCX: 0000000000000018
52[Mon Sep 23 10:49:57 2024] RDX: 000000000000004a RSI: 000000001916052c RDI: 00205d9c1851d6ba
53[Mon Sep 23 10:49:57 2024] RBP: ffffb7218026f658 R08: 0000000000000000 R09: ffff9f2ebe7fa000
54[Mon Sep 23 10:49:57 2024] R10: ffff9f2ff3a6eba8 R11: ffff9f437eb00600 R12: ffffb7218026f6b0
55[Mon Sep 23 10:49:57 2024] R13: ffff9f437eb00700 R14: ffff9f7dde2b0e80 R15: 000000000000003a
56[Mon Sep 23 10:49:57 2024] dev_qdisc_enqueue+0x19/0x90
57[Mon Sep 23 10:49:57 2024] __dev_queue_xmit+0x7a9/0xdf0
58[Mon Sep 23 10:49:57 2024] ? nf_hook_slow+0x3e/0xc0
59[Mon Sep 23 10:49:57 2024] ip_finish_output2+0x2ac/0x580
60[Mon Sep 23 10:49:57 2024] tcp_v4_send_synack+0xb6/0x1f0
61[Mon Sep 23 10:49:57 2024] tcp_conn_request+0xac7/0xd20
62[Mon Sep 23 10:49:57 2024] ? tcp_rcv_state_process+0x424/0xfc0
63[Mon Sep 23 10:49:57 2024] tcp_rcv_state_process+0x424/0xfc0
64[Mon Sep 23 10:49:57 2024] ? security_sock_rcv_skb+0x31/0x50
65[Mon Sep 23 10:49:57 2024] ? sk_filter_trim_cap+0x12b/0x260
66[Mon Sep 23 10:49:57 2024] tcp_v4_do_rcv+0xc4/0x280
67[Mon Sep 23 10:49:57 2024] tcp_v4_rcv+0xe37/0xf10
68[Mon Sep 23 10:49:57 2024] ip_protocol_deliver_rcu+0x32/0x230
69[Mon Sep 23 10:49:57 2024] ip_local_deliver_finish+0x72/0x90
70[Mon Sep 23 10:49:57 2024] ip_sublist_rcv_finish+0x7e/0x90
71[Mon Sep 23 10:49:57 2024] ip_sublist_rcv+0x192/0x240
72[Mon Sep 23 10:49:57 2024] ? ip_sublist_rcv+0x240/0x240
73[Mon Sep 23 10:49:57 2024] ip_list_rcv+0x139/0x170
74[Mon Sep 23 10:49:57 2024] __netif_receive_skb_list_core+0x29e/0x2c0
75[Mon Sep 23 10:49:57 2024] netif_receive_skb_list_internal+0x1cd/0x300
76[Mon Sep 23 10:49:57 2024] ? __alloc_skb+0xd9/0x1a0
77[Mon Sep 23 10:49:57 2024] napi_gro_receive+0xcc/0x200
78[Mon Sep 23 10:49:57 2024] tg3_poll_work+0x64e/0xea0 [tg3]
79[Mon Sep 23 10:49:57 2024] ? sysvec_apic_timer_interrupt+0xab/0xc0
80[Mon Sep 23 10:49:57 2024] tg3_poll_msix+0x39/0x170 [tg3]
81[Mon Sep 23 10:49:57 2024] __napi_poll+0x28/0x160
82[Mon Sep 23 10:49:57 2024] net_rx_action+0x2a5/0x360
83[Mon Sep 23 10:49:57 2024] __do_softirq+0xed/0x2fe
84[Mon Sep 23 10:49:57 2024] ? sort_range+0x20/0x20
85[Mon Sep 23 10:49:57 2024] run_ksoftirqd+0x34/0x40
86[Mon Sep 23 10:49:57 2024] smpboot_thread_fn+0x185/0x220
87[Mon Sep 23 10:49:57 2024] kthread+0xe6/0x110
88[Mon Sep 23 10:49:57 2024] ? kthread_complete_and_exit+0x20/0x20
89[Mon Sep 23 10:49:57 2024] ret_from_fork+0x1f/0x30
90[Mon Sep 23 10:49:57 2024] </TASK>
91[Mon Sep 23 10:49:57 2024] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 734.383 msecs
92[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
93[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
94[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: event severity: corrected
95[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: Error 0, type: corrected
96[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: fru_text: A9
97[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: section_type: memory error
98[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
99[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: physical_address: 0x00000050015e95c0
100[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: node:1 card:0 module:1 rank:1 bank:1 device:0 row:64155 column:776
101[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: error_type: 2, single-bit ECC
102[Mon Sep 23 10:50:00 2024] {1}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
103[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 65534
104[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
105[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: event severity: corrected
106[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: Error 0, type: corrected
107[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: section type: unknown, 330f1140-72a5-11df-9690-0002a5d5c51b
108[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: section length: 0x38
109[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: 00000000: 01010001 00000000 77030000 00000000 ...........w....
110[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: 00000010: 00001000 00000000 77030fff 00000000 ...........w....
111[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: 00000020: 00000080 00000000 00000000 00000000 ................
112[Mon Sep 23 10:50:00 2024] {2}[Hardware Error]: 00000030: 00000000 00000000 ........
113[Mon Sep 23 10:50:00 2024] mce: [Hardware Error]: Machine check events logged
114[Mon Sep 23 10:50:00 2024] EDAC skx MC1: HANDLING MCE MEMORY ERROR
115[Mon Sep 23 10:50:00 2024] EDAC skx MC1: CPU 0: Machine Check Event: 0x0 Bank 255: 0x9c0000000000009f
116[Mon Sep 23 10:50:00 2024] EDAC skx MC1: TSC 0x0
117[Mon Sep 23 10:50:00 2024] EDAC skx MC1: ADDR 0x50015e95c0
118[Mon Sep 23 10:50:00 2024] EDAC skx MC1: MISC 0x8c
119[Mon Sep 23 10:50:00 2024] EDAC skx MC1: PROCESSOR 0:0x50657 TIME 1727089071 SOCKET 0 APIC 0x0
120[Mon Sep 23 10:50:00 2024] EDAC MC1: 0 CE memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#1 (channel:0 slot:1 page:0x50015e9 offset:0x5c0 grain:32 syndrome:0x0 - err_code:0x0000:0x009f ProcessorSocketId:0x0 MemoryControllerId:0x1 PhysicalRankId:0x1 Row:0x1fa9b Column:0x708 Bank:0x1 BankGroup:0x1 retry_rd_err_log[0000b80f 00000000 2c1b4de3 040c5140 00000d38] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
121[Mon Sep 23 10:50:02 2024] TCP: request_sock_TCP: Possible SYN flooding on port 3306. Sending cookies. Check SNMP counters.
122[Mon Sep 23 10:50:51 2024] Disabling lock debugging due to kernel taint
123[Mon Sep 23 10:50:51 2024] mce: [Hardware Error]: Machine check events logged
124[Mon Sep 23 10:50:51 2024] mce: Uncorrected hardware memory error in user-access at 3c3ecfd5c0
125[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
126[Mon Sep 23 10:50:51 2024] Memory failure: 0x3c3ecfd: Sending SIGBUS to mysqld:1332 due to hardware memory corruption
127[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: event severity: recoverable
128[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: Error 0, type: recoverable
129[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: fru_text: A9
130[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: section_type: memory error
131[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400)
132[Mon Sep 23 10:50:51 2024] Memory failure: 0x3c3ecfd: recovery action for dirty LRU page: Recovered
133[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: physical_address: 0x0000003c3ecfd5c0
134[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: node:1 card:0 module:1 rank:1 bank:1 device:0 row:24553 column:1016
135[Mon Sep 23 10:50:51 2024] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
136[Mon Sep 23 10:50:51 2024] Memory failure: 0x3c3ecfd: already hardware poisoned
137[Mon Sep 23 10:50:51 2024] EDAC skx MC1: HANDLING MCE MEMORY ERROR
138[Mon Sep 23 10:50:51 2024] EDAC skx MC1: CPU 0: Machine Check Event: 0x0 Bank 255: 0xbc0000000000009f
139[Mon Sep 23 10:50:51 2024] EDAC skx MC1: TSC 0x0
140[Mon Sep 23 10:50:51 2024] EDAC skx MC1: ADDR 0x3c3ecfd5c0
141[Mon Sep 23 10:50:51 2024] EDAC skx MC1: MISC 0x8c
142[Mon Sep 23 10:50:51 2024] EDAC skx MC1: PROCESSOR 0:0x50657 TIME 1727089122 SOCKET 0 APIC 0x0
143[Mon Sep 23 10:50:51 2024] EDAC MC1: 1 UE memory read error on CPU_SrcID#0_MC#1_Chan#0_DIMM#1 (channel:0 slot:1 page:0x3c3ecfd offset:0x5c0 grain:32 - err_code:0x0000:0x009f ProcessorSocketId:0x0 MemoryControllerId:0x1 PhysicalRankId:0x1 Row:0x15fe9 Column:0x7f8 Bank:0x1 BankGroup:0x1 retry_rd_err_log[0000b80f 00000000 2c1b4de3 040c5140 00000d38] correrrcnt[0000 0000 0000 0000 0000 0000 0000 0000])
144[Mon Sep 23 10:50:54 2024] MCE: Killing mysqld:1332 due to hardware memory corruption fault at 7f4e020fd5c0

as @jcrespo found on P69389 this crash is due to a memory issue on channel:0 slot:1

This confirm the position of the stick that is in error in DIMM slot A9:

image.png (812×931 px, 68 KB)

image.png (85×2 px, 28 KB)

AFAIK pc1015 should be the candidate host if we want to fail it over, from dbctl:

"note": "Hot spare for pc4 and cold spare for pc3",

good catch, let's then start by moving replication from pc4 to: pc3: pc1013 -> pc1015, in the earliest binlog possible, for warmup (this should be a noop), and later we can patch/run dbctl if everybody agrees that's the right approach.

Change #1075024 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Move pc1015 configuration to master of pc3 section

https://gerrit.wikimedia.org/r/1075024

Mentioned in SAL (#wikimedia-operations) [2024-09-23T14:30:15Z] <jynus> restarting and moving replication source of pc1015 T375382

Change #1075024 merged by Jcrespo:

[operations/puppet@production] mariadb: Move pc1015 configuration to master of pc3 section

https://gerrit.wikimedia.org/r/1075024

@MoritzMuehlenhoff mentionned that we might have spare parts available for this server from decommssioned, but not yet recycled servers : @wiki_willy I'm not sure what @ would be best, so I apologize for this one! Could you please help us route our message?

Change #1075036 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Disable pc1013 notifications

https://gerrit.wikimedia.org/r/1075036

Change #1075036 merged by Jcrespo:

[operations/puppet@production] mariadb: Disable pc1013 notifications

https://gerrit.wikimedia.org/r/1075036

I've created T375395 to reflect that, despite being prometed from a replica to a master, and from passive to active, it now has less monitoring than before. I think parsercache should have similar alerting (without paging) than active-active x2.

1{
2 "pc1013": {
3 "host_ip": "10.64.32.163",
4 "note": "hw issues, see: T375382",
5 "port": 3306,
6 "sections": {
7 "pc3": {
8 "percentage": 100,
9 "pooled": false,
10 "weight": 1
11 }
12 }
13 },
14 "tags": "datacenter=eqiad"
15}

jcrespo raised the priority of this task from Medium to High.Mon, Sep 23, 4:27 PM

was unbreak now, high now that issues has been mitigated after pc1013 failover.

++ @Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers

@MoritzMuehlenhoff mentionned that we might have spare parts available for this server from decommssioned, but not yet recycled servers : @wiki_willy I'm not sure what @ would be best, so I apologize for this one! Could you please help us route our message?

Hi! We do have a spare DIMM (32 gig, 2666mts) that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks!

Hi! We do have a spare DIMM (32 gig, 2666mts) that we can swap at anytime for this unit. Please let us know when is the best time to proceed with this. Thanks!

thanks @VRiley-WMF, I think you can go anytime

Replaced failed dimm @ABran-WMF and updated Idrac Bios firmware

great @Jclark-ctr thanks! will take it over from here :-)

ABran-WMF lowered the priority of this task from High to Medium.Wed, Sep 25, 6:35 AM
ABran-WMF updated the task description. (Show Details)
Jclark-ctr claimed this task.
jcrespo removed Jclark-ctr as the assignee of this task.
jcrespo removed projects: DC-Ops, ops-eqiad.
jcrespo moved this task from Done to Ready on the DBA board.

Let's not close it yet, pc1013 is still not working as intended.

Mentioned in SAL (#wikimedia-operations) [2024-10-01T07:54:39Z] <jynus@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382

Mentioned in SAL (#wikimedia-operations) [2024-10-01T07:58:23Z] <jynus@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on pc1013.eqiad.wmnet with reason: T375382