Page MenuHomePhabricator

db1246 crashed
Closed, ResolvedPublic

Description

We got a page for this today evening:

PROBLEM Host db1246 #page - PING CRITICAL - Packet loss = 100%

I depooled it using dbctl. Unrelated to that, it came back up:

RECOVERY Host db1246 #page - PING WARNING - Packet loss = 90%, RTA = 162.36 ms

Given the above, I will still let it depooled overnight and let the team take over.

Event Timeline

It recovered on its own, might be a network issue. I will take a look.

ABran-WMF changed the task status from Open to In Progress.Apr 23 2024, 9:57 AM
ABran-WMF moved this task from Triage to In progress on the DBA board.
ABran-WMF subscribed.

it seems to be a hardware issue as well:

-------------------------------------------------------------------------------
Record:      51
Date/Time:   04/23/2024 00:19:19
Source:      system
Severity:    Critical
Description: The system board BP1 PG voltage is outside of range.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   04/23/2024 00:19:39
Source:      system
Severity:    Ok
Description: The system board BP1 PG voltage is within range.
-------------------------------------------------------------------------------
Jclark-ctr subscribed.

Opened ticket with Dell sr 189290647

@Jclark-ctr this is the third time this host crashes with the same exact HW error see: T361968 T359940 - hopefully Dell won't ask us again to upgrade firmware and BIOS and instead replace whatever piece of HW is broken because it seems pretty clear that there's something going on there.

@Marostegui "At the creation of ticket i requested to not repeat any troubleshooting steps the where not effective"
followed up with dell again they should be sending out parts shortly

Change #1026083 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] installserver: Allowing formatting db1246

https://gerrit.wikimedia.org/r/1026083

Change #1026083 merged by Marostegui:

[operations/puppet@production] installserver: Allowing formatting db1246

https://gerrit.wikimedia.org/r/1026083

Change #1026083 merged by Marostegui:

[operations/puppet@production] installserver: Allowing formatting db1246

https://gerrit.wikimedia.org/r/1026083

I have merged the above patch. It is very likely that this host will have its filesystem corrupted once it is back up - this was the case in the previous two crashes with that same hardware error.
We'll need to reimage and reclone this host.

Change #1026259 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1026259

Change #1026259 merged by Marostegui:

[operations/puppet@production] db1246: Disable notifications

https://gerrit.wikimedia.org/r/1026259

Friday dell agreed to replace Backplane and cables. shipped out Monday expected arrival Tuesday.

Replaced Backplane : cable that connects raid card<-> backplane / power control board. I did find a cable with a loose pin on the power control board (not replaced) but will be reaching out to Dell regarding it it has been reseated in connector and should be fine for the time being

You believe it is all good for us now to start getting this host back to production or you still want to test something else?

I am powering it up now and will check idrac.

I believe we are good to reimage server OS looks corrupt. if you could just wait till tomorrow to put back in production while i wait for Dell to respond if they will send out new cable.

Absolutely - just close this task once your part is done and we will take it from there

@Marostegui you can put server back in rotation even though i uploaded multiple photos yesterday to Dell. They replied this morning requesting part number to send correct part

Screenshot 2024-05-09 at 10.31.55 AM.png (988×1 px, 2 MB)
I attached the photo that was sent to dell. I do not expect it to arrive till next week. We can work on this at a later date.

Thanks John, I will create a subtask for us to work on the formatting, reimage and recloning. Will leave this open until you've finished your side.

@Jclark-ctr is there anything else left from your side or can this be closed too?

The replacement cable did just arrive yesterday. After multiple back and forth with dell Can we leave this open for 1 more week make sure error will not return. leave server running and I will reach out to you for downtime next week for replacement.

@ABran-WMF can you coordinate with @Jclark-ctr to schedule downtime for this host whenever he needs it?

Mentioned in SAL (#wikimedia-operations) [2024-06-05T12:49:19Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'depool db1246 T363119', diff saved to https://phabricator.wikimedia.org/P64101 and previous config saved to /var/cache/conftool/dbconfig/20240605-124918-arnaudb.json

replaced broken cable server went 2 weeks with out fault returning

leaving the host depooled until tomorrow to see if it stays stable, will close the task upon repool.

host is repooling