Page MenuHomePhabricator

db1234 crashed - faulty memory stick on A6 (0x4E42)
Closed, ResolvedPublic

Description

Hi,

db1234 crashed around 16:23 UTC. It's a replica, so I depooled it and have downtimed it for 48 hours.

The kernel log has a lot of MCE and similar hardware complaints.

This is https://portal.victorops.com/ui/wikimedia/incident/5357/details

Event Timeline

ABran-WMF changed the task status from Open to In Progress.Oct 28 2024, 7:59 AM
ABran-WMF triaged this task as Medium priority.
ABran-WMF moved this task from Triage to In progress on the DBA board.
ABran-WMF added a project: ops-eqiad.
ABran-WMF subscribed.

It seems it has a faulty memory stick:

A critical diagnostic event occurred in the memory device at A6. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).

ABran-WMF renamed this task from db1234 crashed to db1234 crashed - faulty memory stick on A6 (0x4E42).Oct 28 2024, 8:00 AM

Mentioned in SAL (#wikimedia-operations) [2024-10-28T08:01:25Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: maintenance T378267

Mentioned in SAL (#wikimedia-operations) [2024-10-28T08:01:37Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: maintenance T378267

fwiw, that slot has a bit of a history: T363102

ah indeed good catch @taavi thanks

This DIMM (A6) has been replaced and the server has been powered back on.

@VRiley-WMF should we maybe try swap its motherboard if the slot is faulty? is there any firmware upgrade that Dell would recommend in that case? please let me know if I can provide more info, hth!

Reached out to Dell for recommendations. Currently, they would like to try replacing the memory one more time before proceeding with motherboard.

Service request number: 200040188
Work order number: 455463658
Replacement part shipped: 1 x DIMM,32GB,3200,2RX8,16G,DDR4,R.

The memory has been swapped.

ABran-WMF claimed this task.

I'm unable to connect to the server, wether on ipmi or on SSH, reopening to troubleshoot

My apologies. Please try again. It seemed that the ethernet wasn't fully seated.

ah! this will simplify the debugging indeed, db1234 is up and alive, thanks @VRiley-WMF !

Mentioned in SAL (#wikimedia-operations) [2024-10-31T10:03:02Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Cloning db1232 in db1234 for T378267', diff saved to https://phabricator.wikimedia.org/P70737 and previous config saved to /var/cache/conftool/dbconfig/20241031-100301-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:05:00Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 1%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70754 and previous config saved to /var/cache/conftool/dbconfig/20241031-140459-arnaudb.json

ABran-WMF moved this task from In progress to Done on the DBA board.

host is repooling after a reclone

Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:20:05Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 2%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70757 and previous config saved to /var/cache/conftool/dbconfig/20241031-142004-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:35:10Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 4%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70760 and previous config saved to /var/cache/conftool/dbconfig/20241031-143510-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:50:16Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 5%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70763 and previous config saved to /var/cache/conftool/dbconfig/20241031-145015-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:05:21Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70765 and previous config saved to /var/cache/conftool/dbconfig/20241031-150521-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:20:27Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70767 and previous config saved to /var/cache/conftool/dbconfig/20241031-152026-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:35:32Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70769 and previous config saved to /var/cache/conftool/dbconfig/20241031-153531-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:50:41Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70770 and previous config saved to /var/cache/conftool/dbconfig/20241031-155037-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-31T16:05:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70772 and previous config saved to /var/cache/conftool/dbconfig/20241031-160542-arnaudb.json