Page MenuHomePhabricator

db1179 crashed - hardware issues
Closed, ResolvedPublic

Description

17:49:57 <+icinga-wm> PROBLEM - Host db1179 #page is DOWN: PING CRITICAL - Packet loss = 100%
17:52:13 <+logmsgbot> !log rzl@cumin2002 dbctl commit (dc=all): 'db1179 depooled', diff saved to https://phabricator.wikimedia.org/P66324 and previous config saved to /var/cache/conftool/dbconfig/20240711-175212-rzl.json

Event Timeline

RLazarus created this task.
ABran-WMF changed the task status from Open to In Progress.Jul 12 2024, 8:50 AM
ABran-WMF claimed this task.
ABran-WMF lowered the priority of this task from High to Medium.
ABran-WMF moved this task from Triage to Ready on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-07-12T08:59:22Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855

Mentioned in SAL (#wikimedia-operations) [2024-07-12T08:59:35Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855

I am unable to reach it via management interface either, it might need a bit of hands on

Change #1054055 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1179: Disable notification for db1179

https://gerrit.wikimedia.org/r/1054055

Also noting that this is a candidate master.

Mentioned in SAL (#wikimedia-operations) [2024-07-15T07:17:50Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855

Mentioned in SAL (#wikimedia-operations) [2024-07-15T07:17:54Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1179.eqiad.wmnet with reason: T369855

Change #1054055 merged by Marostegui:

[operations/puppet@production] db1179: Disable notification for db1179

https://gerrit.wikimedia.org/r/1054055

Also noting that this is a candidate master.

All hosts in x1 are potential candidate masters. They all run the same binlog format, so there's not really a fixed candidate master like on the other sections.

This server has been down for a few days, @wiki_willy please let me know if I can help

Hi @ABran-WMF - can you work with the onsite engineers on this? cc'ing @VRiley-WMF & @Jclark-ctr

This server has been down for a few days, @wiki_willy please let me know if I can help

sure thing!

@VRiley-WMF @Jclark-ctr the host has been depooled and is downtimed, you should be able to take it from here. Feel free to ping if needed!

Hey @ABran-WMF Thanks. I will be looking into this now.

Noted that the server doesn't want to power on. Tried to power cycle it, attempted a flea power drain. Reseated the power cable from the motherboard. Removed all the RAM except A1, and there was no change. No change when removing all the RAM. Will need to investigate further.

Marostegui renamed this task from db1179 stopped answering ping, depooled to db1179 crashed - hardware issues.Jul 18 2024, 1:35 PM

Attempted bare minimum setup (CPU1, A1 RAM, no additional cards) no change. Attempted swapping out the power button module still no change. Will attempt swapping out MB

After swapping the Mainboard, it was finally able to boot. It currently is having issues with RAM which I will continue to troubleshoot tomorrow.

After swapping out the MB, it'll boot. However, it's consistently throwing errors with a few DIMM slots (even after replacing the memory in those slots). Since this server is out of warranty, Continuing to investigate options.

Jclark-ctr claimed this task.
Jclark-ctr added a subscriber: ABran-WMF.

Server is back up no errors on idrac

Server is back up no errors on idrac

Thanks!

Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:07:46Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66872 and previous config saved to /var/cache/conftool/dbconfig/20240722-010745-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:22:51Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66873 and previous config saved to /var/cache/conftool/dbconfig/20240722-012251-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:37:57Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66874 and previous config saved to /var/cache/conftool/dbconfig/20240722-013756-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2024-07-22T01:53:02Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66875 and previous config saved to /var/cache/conftool/dbconfig/20240722-015302-ladsgroup.json

Did this server get the data checksummed or cloned before repooling it back?

No but it had ten days of replication replayed (with RBR) and if it had issues, it would have broken replication really quickly. Also logs also said aria recovery was done and was successful. On top of that, I did several queries to make sure the data is the same before repooling. I can still do a checksum on top if you think is needed but I think the replication not breaking for RBR is usually good enough sign for me.

Sure, that's fine (remember we don't use Aria, so in this case that can be misleading).