Page MenuHomePhabricator

Degraded RAID on dbproxy2005
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host dbproxy2005. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb2[1](F) sda2[0]
      468425728 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

m1 main dbproxy on codfw. I wouldn't fail it over at middle of night since it's codfw but maybe tomorrow if DBAs think we should.

m1 main dbproxy on codfw. I wouldn't fail it over at middle of night since it's codfw but maybe tomorrow if DBAs think we should.

There is nothing to failover to, we don't have another dbproxy host in codfw since at the moment misc isn't used there.

@Papaul @Jhancock.wm do you have a spare disk for this host? We'd need to replace it + reimage the host.

Marostegui triaged this task as Medium priority.Wed, May 20, 5:23 AM
Marostegui moved this task from Triage to In progress on the DBA board.

@Marostegui i do have a spare drive but i'm honestly having a hard time telling which physical drive is down. i have mixed indicators from the server and the error in the main of this ticket. is it possible to blink the one that has failed?

@Jhancock.wm the broken disk is already removed from the disk (sdb) - I can try to make it blink.
I am not sure how long the blink lasts so probably better to make it blink once you are ready to check it.
Let me know when you want me to start it.
Thanks!

Broken disk should be blinking now

The disk has been replaced but it needs a bit of work to make it part of the array as it seems to have old metadata:

root@dbproxy2005:~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md125 : inactive sdb3[1](S)
      460125184 blocks super 1.2

md126 : inactive sdb2[1](S)
      7806976 blocks super 1.2

md127 : inactive sdb1[1](S)
      779264 blocks super 1.2

md0 : active raid1 sda2[0]
      468425728 blocks super 1.2 [2/1] [U_]

unused devices: <none>

I've stopped those fake arrays and copied the table partition from sda and added it back to the array:

root@dbproxy2005:~# sfdisk -d /dev/sda | sfdisk /dev/sdb
Checking that no-one is using this disk right now ... OK

Disk /dev/sdb: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: MTFDDAK480TDC
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0x1e63ecf3

Old situation:

Device     Boot    Start       End   Sectors   Size Id Type
/dev/sdb1           2048   1562623   1560576   762M fd Linux raid autodetect
/dev/sdb2        1562624  17186815  15624192   7.5G fd Linux raid autodetect
/dev/sdb3  *    17186816 937701375 920514560 438.9G fd Linux raid autodetect

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: C459DAA9-7EC0-4A42-AB2E-38CF45F911D0).
/dev/sdb1: Created a new partition 1 of type 'BIOS boot' and of size 285 MiB.
/dev/sdb2: Created a new partition 2 of type 'Linux RAID' and of size 446.9 GiB.
/dev/sdb3: Done.

New situation:
Disklabel type: gpt
Disk identifier: C459DAA9-7EC0-4A42-AB2E-38CF45F911D0

Device      Start       End   Sectors   Size Type
/dev/sdb1    2048    585727    583680   285M BIOS boot
/dev/sdb2  585728 937701375 937115648 446.9G Linux RAID

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@dbproxy2005:~# mdadm /dev/md0 --add /dev/sdb2
mdadm: added /dev/sdb2
root@dbproxy2005:~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[2] sda2[0]
      468425728 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  0.2% (1323264/468425728) finish=376.5min speed=20676K/sec

unused devices: <none>

Let's see if it finishes finely

Progressing nicely:

root@dbproxy2005:~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[2] sda2[0]
      468425728 blocks super 1.2 [2/1] [U_]
      [====>................]  recovery = 21.0% (98628096/468425728) finish=298.9min speed=20615K/sec

unused devices: <none>

All good

root@dbproxy2005:~# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb2[2] sda2[0]
      468425728 blocks super 1.2 [2/2] [UU]

unused devices: <none>