Page MenuHomePhabricator

db2140 broken storage
Closed, ResolvedPublic

Description

Linux db2140 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64
Debian GNU/Linux 11 (bullseye)
db2140 is a Core DB Server (mariadb::core)
DB section s4
Bare Metal Rack: D6
The last Puppet run was at Tue May 10 06:55:10 UTC 2022 (20 minutes ago).
Last puppet commit: (fb812f1bad) Manuel Arostegui - db2109: Disable notifications
Debian GNU/Linux 11 auto-installed on Thu Mar 3 04:08:37 UTC 2022.
-bash: /usr/share/bash-completion/bash_completion: Input/output error
-bash: /home/marostegui/.bash_profile: Input/output error

Impossible to ssh to the host or to send cumin commands:

1 hosts will be targeted:
db2140.codfw.wmnet
Ok to proceed on 1 hosts? Enter the number of affected hosts to confirm or "q" to quit 1
----- OUTPUT of 'sudo dmesg' -----
sudo: unable to execute /usr/bin/dmesg: Input/output error

There's nothing on HW logs: getsel show nothing
@Papaul can you take a look onsite?

Event Timeline

Marostegui triaged this task as Medium priority.May 12 2022, 5:10 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 791112 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2140: Broken host

https://gerrit.wikimedia.org/r/791112

Mentioned in SAL (#wikimedia-operations) [2022-05-12T05:11:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2140 T308202', diff saved to https://phabricator.wikimedia.org/P27791 and previous config saved to /var/cache/conftool/dbconfig/20220512-051106-marostegui.json

Change 791112 merged by Marostegui:

[operations/puppet@production] db2140: Broken host

https://gerrit.wikimedia.org/r/791112

Upgrading the BIOS seems to have fixed the ssh issue. @Marostegui i was getting the error below when rebooting the server

systemd-journal [301]: failed to write entry (22 items, 747 bytes), 
ignoring: Read-only file system

please check to make sure all is good and you can resolve the task.

Thanks

Thanks Papaul, that error is strange indeed.
I have double checked the raid status and also all the controller logs and they look fine (and I can see the firmware upgrade there too) - I have also found no trace of error before the reboot or during the boot up on controller log.
Same for the idrac's log, they are all clean.
Just started mysql and the recovery went well, I am going to leave it replicating for a few hours (or till Monday) before considering this fixed. Just want to make sure it doesn't crash again or the filesystem goes RO again.

Thanks for the fast response.

Marostegui reassigned this task from Marostegui to Papaul.

Thanks Papaul!

db2140 has caught up and seems to be replicating just fine. Closing this for now and if it crashes again we'll reopen