Degraded RAID on ms-be1012
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be1012. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sdn2[1](F) sdm2[0]
      976320 blocks super 1.2 [2/1] [U_]
      
md0 : active raid1 sdm1[0] sdn1[1](F)
      58559360 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2017, 10:59 PM
Volans added a subscriber: Volans.Feb 5 2017, 11:42 PM

Puppet failing too, I've ack'ed the alarm:

Error: xfs_admin -L swift-sdn3 /dev/sdn3 returned 1 instead of one of [0]
Error: /Stage[main]/Role::Swift::Storage/Swift::Label_filesystem[/dev/sdn3]/Exec[xfs_label-/dev/sdn3]/returns: change from notrun to 0 failed: xfs_admin -L swift-sdn3 /dev/sdn3 returned 1 instead of one of [0]

Mentioned in SAL (#wikimedia-operations) [2017-02-07T05:41:34Z] <volans> ms-be1012 running out of space on /, manually compressed /var/log/swift/server.log.1 and cleaned up apt cache T157237

Volans added a comment.Feb 7 2017, 5:43 AM

@fgiunchedi swift it's logging ~1GB/hour... it will be full again in ~15h, could you take a look at it today please?

Mentioned in SAL (#wikimedia-operations) [2017-02-07T11:53:12Z] <godog> stop puppet on ms-be1012 and change rsyslog to avoid local syslog spam - T157237

I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site.

I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site.

Correct, this is one of the SSDs

faidon assigned this task to Cmjohnson.Feb 24 2017, 5:38 PM

Change 340142 had a related patch set uploaded (by Filippo Giunchedi):
swift: ignore spammy 507s from container-server

https://gerrit.wikimedia.org/r/340142

The ssd has been swapped...will need to be added back to raid cfg

Cmjohnson closed this task as "Resolved".Mar 3 2017, 6:07 PM

Raid is back to normal. Resolving this task

RECOVERY - MD RAID on ms-be1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0

Change 340142 merged by Filippo Giunchedi:
[operations/puppet] swift: ignore spammy 507s from container-server

https://gerrit.wikimedia.org/r/340142