Page MenuHomePhabricator

Degraded RAID on restbase2014
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase2014. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 6, Working: 6, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [3/2] [UU_]
      
md0 : active raid1 sda1[0] sdb1[1]
      29279232 blocks super 1.2 [3/2] [UU_]
      
md2 : active raid1 sda3[0] sdb3[1]
      43912192 blocks super 1.2 [3/2] [UU_]
      
unused devices: <none>

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-04-13T06:36:19Z] <elukey> temporary stopped puppet on restbase2014 to avoid attempts to start cassandra on each run - T250050

@Eevans this is the weekend of broken cassandra hosts, adding you as FYI :)

MoritzMuehlenhoff added a subscriber: hnowlan.

@Eevans this is the weekend of broken cassandra hosts, adding you as FYI :)

Thanks :) And thank you for taking a look over the weekend, it is much appreciated!

OK, so it seems like we have a failed SSD (/dev/sdc), and as a result, some degraded arrays. Ideally we'd be able to replace the SSD and rebuild the array, but we are using the /dev/sd[x]4 partitions on these machines as a JBOD for Cassandra. Unfortunately, it distributes its own system tables over these devices as well, and isn't recoverable after losing a chunk of them like this.

TL:DR From a Cassandra perspective, this host is a total loss.

I have done the following:

I removed the guard files to prevent Cassandra from restarting...

eevans@restbase2014:~$ for i in a b c; do sudo rm /etc/cassandra-$i/service-enabled; done

...and started a removenode operation of restbase2014-a (from restbase2013):

eevans@restbase2013:~$ c-any-nt status -r | grep DN
DN  restbase2014-a.codfw.wmnet  570.55 GiB  256          7.0%              307ab9bb-a301-4672-9077-52e8326f8325  b
DN  restbase2014-b.codfw.wmnet  575.79 GiB  256          7.1%              0c2e3790-7d07-4857-b7d4-4078fc58478e  b
DN  restbase2014-c.codfw.wmnet  566.74 GiB  256          6.8%              625d2903-ba2b-48ee-942a-de2220e8968b  b
eevans@restbase2013:~$ c-any-nt removenode 307ab9bb-a301-4672-9077-52e8326f8325
...

This will cause tokens belonging to restbase2014-a to be redistributed to the remaining nodes, and the data streamed from the surviving replicas. It can be monitored using:

eevans@restbase2013:~$ c-any-nt removenode status
RemovalStatus: Removing token (-9212181637719911284). Waiting for replication confirmation from [/10.192.32.192,/10.192.32.193,/10.192.48.68,/10.192.48.69,/10.192.48.70,/10.192.16.82,/10.192.16.83,/10.192.16.84,/10.192.16.98,/10.64.48.98,/10.192.16.99,/10.64.48.99,/10.192.16.100,/10.64.48.100,/10.64.0.101,/10.64.0.102,/10.64.0.103,/10.64.0.105,/10.192.32.105,/10.64.0.106,/10.192.32.108,/10.192.32.111,/10.64.16.114,/10.64.16.115,/10.64.16.116,/10.64.16.118,/10.64.16.119,/10.192.32.119,/10.64.16.120,/10.192.32.120,/10.192.48.121,/10.192.32.121,/10.64.16.122,/10.192.48.122,/10.192.48.123,/10.64.16.123,/10.64.16.124,/10.192.48.124,/10.192.48.125,/10.64.16.126,/10.192.48.126,/10.64.48.126,/10.64.16.127,/10.64.48.127,/10.64.48.128,/10.64.16.128,/10.192.48.142,/10.192.48.143,/10.192.48.144,/10.64.0.146,/10.64.0.148,/10.64.0.149,/10.64.0.150,/10.192.32.22,/10.192.32.152,/10.192.16.153,/10.192.32.153,/10.192.32.25,/10.192.16.154,/10.192.32.154,/10.192.16.155,/10.64.0.32,/10.64.0.33,/10.64.0.34,/10.192.32.175,/10.64.48.180,/10.64.48.181,/10.192.48.54,/10.64.48.182,/10.192.48.55,/10.192.48.56,/10.64.48.184,/10.64.48.185,/10.192.16.186,/10.64.48.186,/10.192.16.187,/10.192.16.188,/10.192.32.191].
eevans@restbase2013:~$

Once complete we'll need to do the b & c instances as well. Once that is complete, we can either re-image the node entirely, or replace the SSD, rebuild the arrays, and then completely wipe Cassandra state (I'd prefer the former for repeatability sake, but defer to SRE here). I'll update the ticket when we're at that point.

@Eevans the IDRAC is not showing any failed drive. Is it possible for you to get me some system logs showing the bad disk so i can upload that when i ask for a disk replacement. The last log i have for this system from the IDRAC is from 2018 .

Also I need to clear the log and upgrade the firmware on this system

BIOS Version 1.5.6
iDRAC Firmware Version 3.21.21.21

new verison

BIOS Version 2.5.4
iDRAC Firmware Version 4.10.10

please let me know when i can do this.

Thanks

[ ... ]

Once complete we'll need to do the b & c instances as well. Once that is complete, we can either re-image the node entirely, or replace the SSD, rebuild the arrays, and then completely wipe Cassandra state (I'd prefer the former for repeatability sake, but defer to SRE here). I'll update the ticket when we're at that point.

This is now done; Cassandra on restbase2014 has been decommissioned from the cluster.

@Eevans the IDRAC is not showing any failed drive. Is it possible for you to get me some system logs showing the bad disk so i can upload that when i ask for a disk replacement. The last log i have for this system from the IDRAC is from 2018 .

Also I need to clear the log and upgrade the firmware on this system

BIOS Version 1.5.6
iDRAC Firmware Version 3.21.21.21

new verison

BIOS Version 2.5.4
iDRAC Firmware Version 4.10.10

please let me know when i can do this.

Thanks

It looks like the machine was just rebooted:

eevans@restbase2014:~$ date -R; uptime
Wed, 15 Apr 2020 16:38:20 +0000
 16:38:20 up 14 min,  1 user,  load average: 1.96, 2.02, 1.56
eevans@restbase2014:~$
RAID status
eevans@restbase2014:~$ for i in 0 1 2; do sudo mdadm --detail /dev/md$i; done
/dev/md0:
        Version : 1.2
  Creation Time : Mon Dec  3 09:06:13 2018
     Raid Level : raid1
     Array Size : 29279232 (27.92 GiB 29.98 GB)
  Used Dev Size : 29279232 (27.92 GiB 29.98 GB)
   Raid Devices : 3
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Apr 15 16:30:38 2020
          State : clean, degraded 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : restbase2014:0  (local to host restbase2014)
           UUID : 24e4c097:b9503e39:0def7b3f:d892cc3c
         Events : 104064

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       -       0        0        2      removed
/dev/md1:
        Version : 1.2
  Creation Time : Mon Dec  3 09:06:13 2018
     Raid Level : raid1
     Array Size : 976320 (953.44 MiB 999.75 MB)
  Used Dev Size : 976320 (953.44 MiB 999.75 MB)
   Raid Devices : 3
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Apr 13 03:11:24 2020
          State : clean, degraded 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : restbase2014:1  (local to host restbase2014)
           UUID : 546eff57:2569a3d2:3bdc9bf7:1e04610e
         Events : 49

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       -       0        0        2      removed
/dev/md2:
        Version : 1.2
  Creation Time : Mon Dec  3 09:06:13 2018
     Raid Level : raid1
     Array Size : 43912192 (41.88 GiB 44.97 GB)
  Used Dev Size : 43912192 (41.88 GiB 44.97 GB)
   Raid Devices : 3
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Wed Apr 15 16:24:04 2020
          State : clean, degraded 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : restbase2014:2  (local to host restbase2014)
           UUID : cb4dc21c:3d595ced:995f5d3e:02afe99a
         Events : 475

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       -       0        0        2      removed
eevans@restbase2014:~$
SSDs
eevans@restbase2014:~$ for i in a b c; do sudo mdadm --examine /dev/sd$i; done
/dev/sda:
   MBR Magic : aa55
Partition[0] :     58591232 sectors at         2048 (type fd)
Partition[1] :      1953792 sectors at     58593280 (type fd)
Partition[2] :     87889920 sectors at     60547072 (type fd)
Partition[3] :   3602311168 sectors at    148436992 (type 83)
/dev/sdb:
   MBR Magic : aa55
Partition[0] :     58591232 sectors at         2048 (type fd)
Partition[1] :      1953792 sectors at     58593280 (type fd)
Partition[2] :     87889920 sectors at     60547072 (type fd)
Partition[3] :   3602311168 sectors at    148436992 (type 83)
mdadm: No md superblock detected on /dev/sdc.
eevans@restbase2014:~$

From the logs, prior to the reboot:

Apr 15 03:30:20 restbase2014 smartd[914]: Device: /dev/sdc [SAT], open() failed: No such device
Apr 15 03:30:20 restbase2014 smartd[914]: Sending warning via /usr/share/smartmontools/smartd-runner to root ...
Apr 15 03:30:20 restbase2014 smart_failure: This message was generated by the smartd daemon running on:
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure:    host name:  restbase2014
Apr 15 03:30:20 restbase2014 smart_failure:    DNS domain: codfw.wmnet
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure: The following warning/error was logged by the smartd daemon:
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure: Device: /dev/sdc [SAT], unable to open device
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure: Device info:
Apr 15 03:30:20 restbase2014 smart_failure: MZ7LM1T9HMJP0D3, S/N:S37PNB0K901485, WWN:5-002538-c40b5f21f, FW:GC5B, 1.92 TB
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure: For details see host's SYSLOG.
Apr 15 03:30:20 restbase2014 smart_failure: 
Apr 15 03:30:20 restbase2014 smart_failure: You can also use the smartctl utility for further investigation.
Apr 15 03:30:20 restbase2014 smart_failure: The original message about this issue was sent at Mon Apr 13 03:30:20 2020 UTC
Apr 15 03:30:20 restbase2014 smart_failure: Another message will be sent in 24 hours if the problem persists.

@Eevans yes was and i am working on it . i logged the message abut this at 15:19 today. thanks
15:19 papaul: upgrading firmware on restbase2014

Create Dispatch: Success
You have successfully submitted request SR1023108711.

@Eevans /dev/sdc has been replaced. Let me know if you have any questions

@Eevans /dev/sdc has been replaced. Let me know if you have any questions

Thanks @Papaul; /cc @hnowlan

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase2014.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202004211512_dzahn_66242_restbase2014_codfw_wmnet.log.

Completed auto-reimage of hosts:

['restbase2014.codfw.wmnet']

Of which those FAILED:

['restbase2014.codfw.wmnet']

Mentioned in SAL (#wikimedia-operations) [2020-04-21T16:49:18Z] <urandom> bootstrapping restbase2014-a — T250050

Mentioned in SAL (#wikimedia-operations) [2020-04-21T19:02:14Z] <urandom> bootstrapping restbase2014-b — T250050

Mentioned in SAL (#wikimedia-operations) [2020-04-21T21:50:46Z] <urandom> bootstrapping restbase2014-c — T250050

AFAIK, this is complete

@Eevans Thank you.

below tracking information for returned disk