Degraded RAID on ms-be1012
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Feb 5 2017, 10:59 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be1012. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sdn2[1](F) sdm2[0]
      976320 blocks super 1.2 [2/1] [U_]
      
md0 : active raid1 sdm1[0] sdn1[1](F)
      58559360 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Details

	Subject	Repo	Branch	Lines +/-
	swift: ignore spammy 507s from container-server	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects

Duplicates Merged Here: T159540: Degraded RAID on ms-be1012

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Feb 5 2017, 10:59 PM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2017, 10:59 PM

Volans added a subscriber: fgiunchedi.Feb 5 2017, 11:01 PM

Puppet failing too, I've ack'ed the alarm:

Error: xfs_admin -L swift-sdn3 /dev/sdn3 returned 1 instead of one of [0]
Error: /Stage[main]/Role::Swift::Storage/Swift::Label_filesystem[/dev/sdn3]/Exec[xfs_label-/dev/sdn3]/returns: change from notrun to 0 failed: xfs_admin -L swift-sdn3 /dev/sdn3 returned 1 instead of one of [0]

Mentioned in SAL (#wikimedia-operations) [2017-02-07T05:41:34Z] <volans> ms-be1012 running out of space on /, manually compressed /var/log/swift/server.log.1 and cleaned up apt cache T157237

@fgiunchedi swift it's logging ~1GB/hour... it will be full again in ~15h, could you take a look at it today please?

Mentioned in SAL (#wikimedia-operations) [2017-02-07T11:53:12Z] <godog> stop puppet on ms-be1012 and change rsyslog to avoid local syslog spam - T157237

I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site.

In T157237#3005905, @Cmjohnson wrote:

I am assuming this is one of the ssds when I pull the pd list with megacli a ssd is missing. Please confirm. The system is out of warranty but we have spares on-site.

Correct, this is one of the SSDs

faidon assigned this task to • Cmjohnson.Feb 24 2017, 5:38 PM

Change 340142 had a related patch set uploaded (by Filippo Giunchedi):
swift: ignore spammy 507s from container-server

https://gerrit.wikimedia.org/r/340142

gerritbot added a project: Patch-For-Review.Feb 27 2017, 4:42 PM

The ssd has been swapped...will need to be added back to raid cfg

fgiunchedi merged a task: T159540: Degraded RAID on ms-be1012.Mar 3 2017, 4:04 PM

Raid is back to normal. Resolving this task

RECOVERY - MD RAID on ms-be1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0

Change 340142 merged by Filippo Giunchedi:
[operations/puppet] swift: ignore spammy 507s from container-server

https://gerrit.wikimedia.org/r/340142