Page MenuHomePhabricator

db2088 crashed
Closed, ResolvedPublic

Description

db2088 crashed on Saturday:

12:55:21 <+icinga-wm> PROBLEM - Host db2088 is DOWN: PING CRITICAL - Packet loss = 100%

Event Timeline

Marostegui triaged this task as Medium priority.May 30 2022, 5:31 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 801171 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2088: Disable notifications

https://gerrit.wikimedia.org/r/801171

Change 801171 merged by Marostegui:

[operations/puppet@production] db2088: Disable notifications

https://gerrit.wikimedia.org/r/801171

Mentioned in SAL (#wikimedia-operations) [2022-05-30T05:35:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2088 (s1 and s2) T309485', diff saved to https://phabricator.wikimedia.org/P28913 and previous config saved to /var/cache/conftool/dbconfig/20220530-053459-marostegui.json

Marostegui added a project: ops-codfw.
Marostegui added a subscriber: Papaul.

@Papaul db2088's mgmt interface is also unavailable so I cannot check the logs and/or if the host is up and the network failed.
Can you check on-site?

Thank you!

I removed the power for 10 minutes, the server came backup. IDRAC log not showing any HW issues. I upgrade the BIOS and IDRAC on the node. The server is back up.

Thanks Papaul. I can indeed access the host now.
MySQL seems to be fine.

I am going to repool this host once it catches up and close this. If it happens again, we can probably decommission it as it is scheduled for refresh and the replacement hardware has been ordered and will arrive in a few months. We'll probably not have a DC switchover before that anyways.

Mentioned in SAL (#wikimedia-operations) [2022-06-02T05:14:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2088 (s1 and s2) T309485', diff saved to https://phabricator.wikimedia.org/P29327 and previous config saved to /var/cache/conftool/dbconfig/20220602-051451-marostegui.json

Marostegui reassigned this task from Marostegui to Papaul.

db2088 is back in sync with both s1 and s2 master. I have repooled it. Closing this for now. If it happens again we should probably just decommission it.
Thank you Papaul!