Page MenuHomePhabricator

Degraded RAID on thumbor2002
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host thumbor2002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] 
md2 : active raid1 sda3[0] sdb3[1]
      438449152 blocks super 1.2 [2/2] [UU]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0](F) sdb1[1]
      48794624 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2019, 4:05 AM

Mentioned in SAL (#wikimedia-operations) [2019-01-29T21:52:55Z] <jijiki> Depooling thumbor2002 due to disc failure - T214813

jijiki triaged this task as Medium priority.Jan 29 2019, 9:56 PM
RobH assigned this task to Papaul.Jan 29 2019, 9:59 PM
RobH added a subscriber: RobH.

This system just left warranty this month, so any disk swaps will have to be done with on-site spares (so nothing to do remotely, no support cases to file, just disk swap.)

Papaul reassigned this task from Papaul to RobH.Feb 4 2019, 4:55 PM
Papaul added a subscriber: Papaul.

Can you please update this disk with which disk failed? Thanks

RobH reassigned this task from RobH to Papaul.Feb 4 2019, 5:05 PM

Ok, here are the full commands (so you can also run in future as needed):

robh@thumbor2002:~$ cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sda3[0](F) sdb3[1]
      438449152 blocks super 1.2 [2/1] [_U]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0](F) sdb1[1]
      48794624 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
robh@thumbor2002:~$ mdadm --detail /dev/md0
-bash: mdadm: command not found
robh@thumbor2002:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jun 23 08:33:45 2017
     Raid Level : raid1
     Array Size : 48794624 (46.53 GiB 49.97 GB)
  Used Dev Size : 48794624 (46.53 GiB 49.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Feb  4 17:02:01 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : thumbor2002:0  (local to host thumbor2002)
           UUID : db0dd5f7:81b98aef:d95c05ee:4017a42d
         Events : 318286

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       0       8        1        -      faulty   /dev/sda1
robh@thumbor2002:~$ sudo mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jun 23 08:33:45 2017
     Raid Level : raid1
     Array Size : 438449152 (418.14 GiB 448.97 GB)
  Used Dev Size : 438449152 (418.14 GiB 448.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Feb  4 14:17:16 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : thumbor2002:2  (local to host thumbor2002)
           UUID : a7c34a52:b3e287a5:8893f0fb:fe2f8943
         Events : 13002

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       19        1      active sync   /dev/sdb3

       0       8        3        -      faulty   /dev/sda3

robh@thumbor2002:~$ sudo lshw  -class disk
  *-disk                  
       description: ATA Disk
       product: WDC WD5003ABYX-1
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: 1S05
       serial: WD-WMAYP0E607DT
       size: 465GiB (500GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=78051cb0
  *-disk
       description: ATA Disk
       product: WDC WD5003ABYX-1
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: 1S05
       serial: WD-WMAYP0EAAZ87
       size: 465GiB (500GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=fa47f97d

So we can see that SDA has failed on this system. Do you have any on-site spares that would work to replace it? It needs to be a 500GB SATA disk (or larger). If you don't have any spares (500GB-1TB) on the shelf, we'll need to re-evaluate if we want to keep swapping hardware on an out of warranty host or replace the entire host. (This seems like the first disk failure on this host in awhile though, so likely just best to replace the faulty disk.)

RobH added a comment.Feb 4 2019, 5:15 PM

In checking dc spares tracking, it shows 11 500GB SATA disks in codfw spare hardware. If this isn't right, please update task and update the tracking sheet.

Thanks!

Papaul added a comment.Feb 4 2019, 5:48 PM

Disk with serial number WMAYP0E607DT has been replaced. Server can not find boot device. Server can not boot to OS after disk replacement.

Papaul added a subscriber: jcrespo.Feb 4 2019, 6:18 PM

I put back the bad disk and boot the system and the system boot into OS with no problem. it looks like what @jcrespo and other mentioned on IRC the grub is installed on /dev/sda/ only which is the disk that needs to be replaced. so we need to fix this issue first so I can be able to replace the disk.

jijiki added a subscriber: jijiki.Feb 4 2019, 6:28 PM

Ack, I will do it tomorrow, thank you @Papaul !

@Papaul At your convenience, please replace the drive again, the machine should boot. The server has been already depooled since last week.

If the server fails to boot, please leave it as is. We are planning to upgrade those servers to stretch, I reckon we could kick this process off with thumbor2002.

jijiki moved this task from Backlog to Doing on the serviceops board.Feb 6 2019, 12:10 PM
Papaul reassigned this task from Papaul to jijiki.Feb 6 2019, 3:53 PM

Disk replaced, server didn't boot up.

jijiki added a comment.Feb 6 2019, 3:59 PM

@Papaul Thank you! I will reimage this server, no need to spend more time on it

jijiki moved this task from Backlog/Radar to St on the User-jijiki board.Feb 8 2019, 10:55 AM
jijiki moved this task from St to In Progress on the User-jijiki board.Feb 14 2019, 11:01 AM

Server will be re-imaged to stretch as part of upgrading Thumbor servers to stretch - T214597

jijiki reopened this task as Open.Feb 18 2019, 12:51 PM
jijiki added a subscriber: elukey.

@Papaul I am unable to reimage the server because PXE boot is failing. Server says:

Broadcom UNDI PXE-2.1 v16.4.3
Copyright (C) 2000-2014 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Broadcom PXE ROM.

I also checked the status of its port on the switch with @elukey, where the switch reports no link. Can you please take a look if you get the chance?

The network cable was was plugged back in after the disk replacement. Should be good now.

jijiki closed this task as Resolved.Feb 19 2019, 5:16 PM

@Papaul Thank you, it works now:)