Page MenuHomePhabricator

Degraded RAID on restbase1018
Closed, DuplicatePublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host restbase1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [raid1] [raid0] 
md2 : active raid0 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      6129635328 blocks super 1.2 512k chunks
      
md1 : active (auto-read-only) raid1 sda2[0] sdd2[3] sdc2[2] sdb2[1]
      976320 blocks super 1.2 [4/4] [UUUU]
      
md0 : active raid1 sda1[0] sdd1[3] sdc1[2](F) sdb1[1]
      29279232 blocks super 1.2 [4/3] [UU_U]
      
unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2017, 11:01 PM
Dzahn added a subscriber: Dzahn.Apr 18 2017, 11:07 PM

depooled - 16:07 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1018.eqiad.wmnet

16:05 < icinga-wm> PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
16:05 < icinga-wm> PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
16:05 < icinga-wm> PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
16:05 < icinga-wm> PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
...

exactly half an hour later
...

16:35 < icinga-wm> RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational
16:35 < icinga-wm> RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active
16:35 < icinga-wm> RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active
16:35 < mutante> oh really
16:36 < mutante> self-healing is always appreciated

Eevans added a subscriber: Eevans.EditedApr 19 2017, 12:45 AM

According to mdadm, only /dev/md0 is degraded (/), but /dev/md2 (aka /srv) is inaccessible as well; I think /dev/sdc is failed. What is the ETA for replacement?

All Cassandra instances on this host have been decommissioned; It can be taken down for repair at anytime and without any coordination from Services.

If I recall these have special ssds in them correct?

If I recall these have special ssds in them correct?

Not this one, no; Model: Intel SSDSC2BX016T4R

Eevans added a comment.EditedApr 24 2017, 3:22 PM

I realize things are probably hectic there with the switchover, and this isn't currently causing any issues, but do we have an ETA on this? At a minimum, can you tell me if we can expect to have this host up prior to switching back, with time-enough to reprovision the instances (approx 2 weekday days)?

A disk replacement has been ordered with Dell

Create Service Request: Service Tag 753NMD2

Confirmed: Request 947500398 was successfully submitted.

part has been dispatched

Dispatch Reference Number #325751063
Scheduled to arrive: 4/26/2017

Incoming ticket opened with equinix Order Number
1-101081657220

The ssd has been received and swapped. @Marostegui or someone else please fix the raid cfg and resolve. Thanks

Return shipping number is USPS 9202 3946 5301 2435 4073 66
FEDEX (9611918) 2393026 72157380

Eevans triaged this task as Medium priority.May 1 2017, 3:52 PM

Ping?

The disk has been replaced, can someone rebuild the raid please

The ssd has been received and swapped. @Marostegui or someone else please fix the raid cfg and resolve. Thanks

I would prefer if someone more familiarized with this host touches the RAID instead of me :-)

Mentioned in SAL (#wikimedia-operations) [2017-05-03T07:42:32Z] <_joe_> rebuilding RAIDs on restbase1018 T163280

Mentioned in SAL (#wikimedia-operations) [2017-05-03T08:24:39Z] <_joe_> deactivating restbase1018-vg for RAID failover and rebuild T163280

Mentioned in SAL (#wikimedia-operations) [2017-05-03T08:53:08Z] <_joe_> rebooting restbase1018 T163280