Page MenuHomePhabricator

Degraded RAID on elastic1039
Closed, ResolvedPublic


TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic1039. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
unused devices: <none>

Related Objects

Event Timeline

This is causing mjolnir deploy directory to become unavailable/missing and also causing puppet to fail.

Gehel added a comment.Oct 28 2019, 5:58 PM

This server is scheduled to be replaced, let's not fix anything.

wiki_willy assigned this task to Gehel.Oct 28 2019, 5:58 PM
wiki_willy added a subscriber: wiki_willy.

Per my conversation with Guillaume, this system will be decommissioned, so assigning it to @Gehel for now.

Gehel added a comment.Oct 28 2019, 6:45 PM

This server is scheduled to be replaced, let's not fix anything.

Oops, we're actually only replacing elastic1017-1031, so we'll need to keep this one alive for a bit longer.

@wiki_willy : can you order a new disk?

@Gehel : See the SRE meeting doc from today, there's now a new form for these requests:

wiki_willy added a subtask: Unknown Object (Task).Oct 28 2019, 6:51 PM

Thanks @MoritzMuehlenhoff - no worries though, since this task looks like it was autogenerated. (I'll have to talk to Ricardo on how we can modify the autogenerated ones) @Gehel - child task T236725 created to order the replacement disk for the out of warranty system. Thanks, Willy

wiki_willy reassigned this task from Gehel to Jclark-ctr.Oct 28 2019, 6:54 PM
Gehel triaged this task as High priority.Oct 30 2019, 3:19 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Mentioned in SAL (#wikimedia-operations) [2019-10-30T15:23:43Z] <gehel> shutting down elastic1039 to be ready for disk swap - T236601

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Nov 8 2019, 5:29 PM
Cmjohnson added a subscriber: Cmjohnson.

@Jclark-ctr I have the disk and will leave in data center for you to replace the disk next week.

@Gehel The disk is here, Can this be done anytime or does this need to be coordinated?

Gehel added a comment.Nov 13 2019, 4:11 PM

@Cmjohnson server is already shutdown, do whatever you want, whenever you want! RAID0, so it will require a reimage once the new disk is in place, but I can do that anytime.

Cmjohnson closed this task as Resolved.Nov 25 2019, 1:25 PM

Resolving this task for the failed raid, @Gehel you may want to create a new one for the re-image.