Degraded RAID on elastic1039
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Oct 26 2019, 8:12 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic1039. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [raid0] [linear] [multipath] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      1503967232 blocks super 1.2 512k chunks
      
md0 : active raid1 sda1[0](F) sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Jclark-ctr	T236601 Degraded RAID on elastic1039
					Unknown Object (Task)

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Oct 26 2019, 8:12 PM

ops-monitoring-bot subscribed.

Peachey88 added a project: Discovery-Search.Oct 27 2019, 2:13 AM

This is causing mjolnir deploy directory to become unavailable/missing and also causing puppet to fail.

• Mathew.onipe added a subscriber: Gehel.Oct 27 2019, 6:53 PM

This server is scheduled to be replaced, let's not fix anything.

Per my conversation with Guillaume, this system will be decommissioned, so assigning it to @Gehel for now.

In T236601#5612513, @Gehel wrote:

This server is scheduled to be replaced, let's not fix anything.

Oops, we're actually only replacing elastic1017-1031, so we'll need to keep this one alive for a bit longer.

@wiki_willy : can you order a new disk?

@Gehel : See the SRE meeting doc from today, there's now a new form for these requests: https://phabricator.wikimedia.org/maniphest/task/edit/form/55/

wiki_willy added a subtask: Unknown Object (Task).Oct 28 2019, 6:51 PM

Thanks @MoritzMuehlenhoff - no worries though, since this task looks like it was autogenerated. (I'll have to talk to Ricardo on how we can modify the autogenerated ones) @Gehel - child task T236725 created to order the replacement disk for the out of warranty system. Thanks, Willy

wiki_willy reassigned this task from Gehel to Jclark-ctr.Oct 28 2019, 6:54 PM

Gehel triaged this task as High priority.Oct 30 2019, 3:19 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.Oct 30 2019, 3:22 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-30T15:23:43Z] <gehel> shutting down elastic1039 to be ready for disk swap - T236601

@Jclark-ctr I have the disk and will leave in data center for you to replace the disk next week.

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Nov 13 2019, 3:53 PM

@Gehel The disk is here, Can this be done anytime or does this need to be coordinated?

@Cmjohnson server is already shutdown, do whatever you want, whenever you want! RAID0, so it will require a reimage once the new disk is in place, but I can do that anytime.

@Gehel replaced drive

Resolving this task for the failed raid, @Gehel you may want to create a new one for the re-image.

Gehel mentioned this in T239116: reimage elastic1039 now that disk has been replaced.Nov 25 2019, 2:59 PM

Degraded RAID on elastic1039Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on elastic1039
Closed, ResolvedPublic
Actions

Related Objects
Search...