Degraded RAID on thumbor2002
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jan 28 2019, 4:05 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host thumbor2002. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] 
md2 : active raid1 sda3[0] sdb3[1]
      438449152 blocks super 1.2 [2/2] [UU]
      bitmap: 2/4 pages [8KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0](F) sdb1[1]
      48794624 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Related Objects
Search...

Status	Assigned	Task
Resolved	jijiki	T170817 Upgrade Thumbor servers to Stretch
Resolved	jijiki	T214597 Thumbor upgrade to stretch plan
Resolved	jijiki	T214813 Degraded RAID on thumbor2002
Resolved	jijiki	T216494 Deploy 3d2png to thumbor servers (stretch)
Resolved	fgiunchedi	T216807 Meta Swift container rights incorrect for thumbor user

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Jan 28 2019, 4:05 AM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2019, 4:05 AM

Peachey88 merged a task: T214814: Degraded RAID on thumbor2002.Jan 28 2019, 5:32 AM

Kizule merged a task: T214815: Degraded RAID on thumbor2002.Jan 28 2019, 4:31 PM

jijiki added projects: User-jijiki, serviceops.Jan 29 2019, 9:51 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-29T21:52:55Z] <jijiki> Depooling thumbor2002 due to disc failure - T214813

jijiki triaged this task as Medium priority.Jan 29 2019, 9:56 PM

This system just left warranty this month, so any disk swaps will have to be done with on-site spares (so nothing to do remotely, no support cases to file, just disk swap.)

Can you please update this disk with which disk failed? Thanks

Ok, here are the full commands (so you can also run in future as needed):

robh@thumbor2002:~$ cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sda3[0](F) sdb3[1]
      438449152 blocks super 1.2 [2/1] [_U]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sda1[0](F) sdb1[1]
      48794624 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>
robh@thumbor2002:~$ mdadm --detail /dev/md0
-bash: mdadm: command not found
robh@thumbor2002:~$ sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jun 23 08:33:45 2017
     Raid Level : raid1
     Array Size : 48794624 (46.53 GiB 49.97 GB)
  Used Dev Size : 48794624 (46.53 GiB 49.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Feb  4 17:02:01 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : thumbor2002:0  (local to host thumbor2002)
           UUID : db0dd5f7:81b98aef:d95c05ee:4017a42d
         Events : 318286

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       0       8        1        -      faulty   /dev/sda1
robh@thumbor2002:~$ sudo mdadm --detail /dev/md2
/dev/md2:
        Version : 1.2
  Creation Time : Fri Jun 23 08:33:45 2017
     Raid Level : raid1
     Array Size : 438449152 (418.14 GiB 448.97 GB)
  Used Dev Size : 438449152 (418.14 GiB 448.97 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Feb  4 14:17:16 2019
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : thumbor2002:2  (local to host thumbor2002)
           UUID : a7c34a52:b3e287a5:8893f0fb:fe2f8943
         Events : 13002

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       19        1      active sync   /dev/sdb3

       0       8        3        -      faulty   /dev/sda3

robh@thumbor2002:~$ sudo lshw  -class disk
  *-disk                  
       description: ATA Disk
       product: WDC WD5003ABYX-1
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: 1S05
       serial: WD-WMAYP0E607DT
       size: 465GiB (500GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=78051cb0
  *-disk
       description: ATA Disk
       product: WDC WD5003ABYX-1
       vendor: Western Digital
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: 1S05
       serial: WD-WMAYP0EAAZ87
       size: 465GiB (500GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=fa47f97d

So we can see that SDA has failed on this system. Do you have any on-site spares that would work to replace it? It needs to be a 500GB SATA disk (or larger). If you don't have any spares (500GB-1TB) on the shelf, we'll need to re-evaluate if we want to keep swapping hardware on an out of warranty host or replace the entire host. (This seems like the first disk failure on this host in awhile though, so likely just best to replace the faulty disk.)

In checking dc spares tracking, it shows 11 500GB SATA disks in codfw spare hardware. If this isn't right, please update task and update the tracking sheet.

Thanks!

Disk with serial number WMAYP0E607DT has been replaced. Server can not find boot device. Server can not boot to OS after disk replacement.

RobH mentioned this in T215183: Redundant bootloaders for software RAID.Feb 4 2019, 6:06 PM

I put back the bad disk and boot the system and the system boot into OS with no problem. it looks like what @jcrespo and other mentioned on IRC the grub is installed on /dev/sda/ only which is the disk that needs to be replaced. so we need to fix this issue first so I can be able to replace the disk.

Ack, I will do it tomorrow, thank you @Papaul !

jijiki merged a task: T215185: Degraded RAID on thumbor2002.Feb 5 2019, 1:33 PM

@Papaul At your convenience, please replace the drive again, the machine should boot. The server has been already depooled since last week.

If the server fails to boot, please leave it as is. We are planning to upgrade those servers to stretch, I reckon we could kick this process off with thumbor2002.

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Feb 6 2019, 12:10 PM

Disk replaced, server didn't boot up.

@Papaul Thank you! I will reimage this server, no need to spend more time on it

jijiki moved this task from Incoming🐅 to St on the User-jijiki board.Feb 8 2019, 10:55 AM

jijiki moved this task from St to In Progress 🏋️‍♀️ on the User-jijiki board.Feb 14 2019, 11:01 AM

Server will be re-imaged to stretch as part of upgrading Thumbor servers to stretch - T214597

jijiki closed this task as Resolved.Feb 18 2019, 10:53 AM

jijiki added parent tasks: T214597: Thumbor upgrade to stretch plan, T170817: Upgrade Thumbor servers to Stretch.

@Papaul I am unable to reimage the server because PXE boot is failing. Server says:

Broadcom UNDI PXE-2.1 v16.4.3
Copyright (C) 2000-2014 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.
PXE-E61: Media test failure, check cable
PXE-M0F: Exiting Broadcom PXE ROM.

I also checked the status of its port on the switch with @elukey, where the switch reports no link. Can you please take a look if you get the chance?

The network cable was was plugged back in after the disk replacement. Should be good now.

@Papaul Thank you, it works now:)

Degraded RAID on thumbor2002Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on thumbor2002
Closed, ResolvedPublic
Actions

Related Objects
Search...