Hi,
db1234 crashed around 16:23 UTC. It's a replica, so I depooled it and have downtimed it for 48 hours.
The kernel log has a lot of MCE and similar hardware complaints.
This is https://portal.victorops.com/ui/wikimedia/incident/5357/details
Hi,
db1234 crashed around 16:23 UTC. It's a replica, so I depooled it and have downtimed it for 48 hours.
The kernel log has a lot of MCE and similar hardware complaints.
This is https://portal.victorops.com/ui/wikimedia/incident/5357/details
It seems it has a faulty memory stick:
A critical diagnostic event occurred in the memory device at A6. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
Mentioned in SAL (#wikimedia-operations) [2024-10-28T08:01:25Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: maintenance T378267
Mentioned in SAL (#wikimedia-operations) [2024-10-28T08:01:37Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: maintenance T378267
ah indeed good catch @taavi thanks
@VRiley-WMF should we maybe try swap its motherboard if the slot is faulty? is there any firmware upgrade that Dell would recommend in that case? please let me know if I can provide more info, hth!
Reached out to Dell for recommendations. Currently, they would like to try replacing the memory one more time before proceeding with motherboard.
Service request number: 200040188
Work order number: 455463658
Replacement part shipped: 1 x DIMM,32GB,3200,2RX8,16G,DDR4,R.
I'm unable to connect to the server, wether on ipmi or on SSH, reopening to troubleshoot
ah! this will simplify the debugging indeed, db1234 is up and alive, thanks @VRiley-WMF !
Mentioned in SAL (#wikimedia-operations) [2024-10-31T10:03:02Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Cloning db1232 in db1234 for T378267', diff saved to https://phabricator.wikimedia.org/P70737 and previous config saved to /var/cache/conftool/dbconfig/20241031-100301-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:05:00Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 1%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70754 and previous config saved to /var/cache/conftool/dbconfig/20241031-140459-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:20:05Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 2%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70757 and previous config saved to /var/cache/conftool/dbconfig/20241031-142004-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:35:10Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 4%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70760 and previous config saved to /var/cache/conftool/dbconfig/20241031-143510-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T14:50:16Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 5%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70763 and previous config saved to /var/cache/conftool/dbconfig/20241031-145015-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:05:21Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 10%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70765 and previous config saved to /var/cache/conftool/dbconfig/20241031-150521-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:20:27Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70767 and previous config saved to /var/cache/conftool/dbconfig/20241031-152026-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:35:32Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70769 and previous config saved to /var/cache/conftool/dbconfig/20241031-153531-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T15:50:41Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70770 and previous config saved to /var/cache/conftool/dbconfig/20241031-155037-arnaudb.json
Mentioned in SAL (#wikimedia-operations) [2024-10-31T16:05:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: post T378267 reclone', diff saved to https://phabricator.wikimedia.org/P70772 and previous config saved to /var/cache/conftool/dbconfig/20241031-160542-arnaudb.json