Page MenuHomePhabricator

Degraded RAID on elastic1039
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic1039. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Related Objects

Event Timeline

This is causing mjolnir deploy directory to become unavailable/missing and also causing puppet to fail.

This server is scheduled to be replaced, let's not fix anything.

wiki_willy subscribed.

Per my conversation with Guillaume, this system will be decommissioned, so assigning it to @Gehel for now.

This server is scheduled to be replaced, let's not fix anything.

Oops, we're actually only replacing elastic1017-1031, so we'll need to keep this one alive for a bit longer.

@wiki_willy : can you order a new disk?

wiki_willy added a subtask: Unknown Object (Task).Oct 28 2019, 6:51 PM

Thanks @MoritzMuehlenhoff - no worries though, since this task looks like it was autogenerated. (I'll have to talk to Ricardo on how we can modify the autogenerated ones) @Gehel - child task T236725 created to order the replacement disk for the out of warranty system. Thanks, Willy

Gehel triaged this task as High priority.Oct 30 2019, 3:19 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Mentioned in SAL (#wikimedia-operations) [2019-10-30T15:23:43Z] <gehel> shutting down elastic1039 to be ready for disk swap - T236601

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Nov 8 2019, 5:29 PM
Cmjohnson subscribed.

@Jclark-ctr I have the disk and will leave in data center for you to replace the disk next week.

@Gehel The disk is here, Can this be done anytime or does this need to be coordinated?

@Cmjohnson server is already shutdown, do whatever you want, whenever you want! RAID0, so it will require a reimage once the new disk is in place, but I can do that anytime.

Resolving this task for the failed raid, @Gehel you may want to create a new one for the re-image.