Degraded RAID on cloudvirt1014
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Dec 27 2019, 5:32 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host cloudvirt1014. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5, 2I:1:6 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P440ar in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 2.9 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 512 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

Details

	Subject	Repo	Branch	Lines +/-
	nova: add cloudvirt1014 to scheduler pool	operations/puppet	production	+2 -1
	nova: depool cloudvirt1022 and cloudvirt1014	operations/puppet	production	+2 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Cmjohnson	T138509 rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014)
Duplicate	None	T241492 cloudvirt1014 crash
Resolved	Jclark-ctr	T241494 Degraded RAID on cloudvirt1014

Event Timeline

ops-monitoring-bot added projects: SRE, ops-eqiad.Dec 27 2019, 5:32 PM

ops-monitoring-bot subscribed.

aborrero added a parent task: T241492: cloudvirt1014 crash.Dec 27 2019, 5:35 PM

This seems to refer to missing battery. Disks seem to be OK, right?

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Dec 28 2019, 9:05 PM

herron triaged this task as High priority.Jan 3 2020, 7:44 PM

wiki_willy assigned this task to Jclark-ctr.Jan 6 2020, 4:48 PM

@Jclark-ctr I believe this server may need the BBU checked/replaced, but I may be wrong.

BTW this server has active workloads (pooled) at the moment. Please @Jclark-ctr coordinate with WMCS before shutting server down.

aborrero added a project: cloud-services-team (Kanban).Jan 24 2020, 1:14 PM

aborrero moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.

aborrero edited projects, added cloud-services-team (Hardware); removed cloud-services-team (Kanban).

aborrero moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

@aborrero It is out of warranty i do have a spare bbu and can replace it. @JHedden and i had spoken briefly regarding this one last night. I am on site now and if you would like to change it today I can. But it is friday before allhands. Would like to change it on return unless you think this is urgent.

Change 572072 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

Change 572072 merged by Andrew Bogott:
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

This host is now drained and ready for maintenance.

Change 575094 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] nova: add cloudvirt1014 to scheduler pool

https://gerrit.wikimedia.org/r/575094

Change 575094 merged by Jhedden:
[operations/puppet@production] nova: add cloudvirt1014 to scheduler pool

https://gerrit.wikimedia.org/r/575094