Page MenuHomePhabricator

investigate RAID failure on beryllium.frack.eqiad.wmnet
Closed, ResolvedPublic

Description

Notification Type: PROBLEM

Service: check_raid
Host: beryllium
Address: 10.64.40.68
State: CRITICAL

Date/Time: Thu May 12 21:00:08 UTC 2016

Additional Info:

CRITICAL: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=1, wk=1, fail=1, sp=0:
/dev/md/3: act=2, wk=2, fail=0, sp=0

Event Timeline

Restricted Application added subscribers: Zppix, Southparkfan, Aklapper. · View Herald Transcript
Jgreen triaged this task as Unbreak Now! priority.

closed by accident

looks like /dev/sda failed:

[1509411.577517] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[1509411.577519] sd 0:0:0:0: [sda] CDB:
[1509411.577520] Read(10): 28 00 07 f3 80 08 00 00 08 00
[1509411.577525] end_request: I/O error, dev sda, sector 133398536

please replace ASAP, it's:

Vendor: ATA Model: SAMSUNG HE502HJ Rev: 1AJ3

Jgreen removed a project: fundraising-tech-ops.

tossing your way @Cmjohnson as I'm guessing this is all you :)

@Jgreen, We will need to scheduled down time to replace the disk. Also,
please make sure grub is on /dev/sdb

@Jgreen, We will need to scheduled down time to replace the disk. Also,
please make sure grub is on /dev/sdb

After removing /dev/sda* from various RAID1 mdadm devices, I did grub-install --recheck /dev/sdb and got:

root@beryllium:~# grub-install --recheck /dev/sdb
Installing for i386-pc platform.
grub-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
grub-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..
Installation finished. No error reported.

Hopefully it worked...

Regarding downtime you should be fine to do it anytime, since the peer auth server at codfw will seamlessly handle requests while beryllium is down.

@Jgreen, We will need to scheduled down time to replace the disk.

@Cmjohnson / @Jgreen: Has that happened?

Ended up replacing both disks and Jeff will re-install.

rebuilt, kerberos data restored