Degraded RAID on cloudcephosd1018
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jun 29 2021, 7:33 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host cloudcephosd1018. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CHECK_NRPE: Error - Could not connect to 10.64.20.15. Check system logs on 10.64.20.15

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'cloudcephosd1018', '-c', 'get_raid_status_md']': RETCODE: 2
STDOUT:
b'CHECK_NRPE: Error - Could not connect to 10.64.20.15: Connection reset by peer\n'
STDERR:
None

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	dcaro	T285858 Install the new ceph osd machines cloudcephosd10(1[6-9]\|20) using cookbooks
Resolved	RobH	T274945 (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet
Resolved	• Cmjohnson	T285799 Degraded RAID on cloudcephosd1018

Event Timeline

ops-monitoring-bot created this task.Jun 29 2021, 7:33 PM

Peachey88 added a project: cloud-services-team.Jun 29 2021, 9:04 PM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJun 29 2021, 9:04 PM

wiki_willy assigned this task to • Cmjohnson.Jun 29 2021, 10:41 PM

dcaro subscribed.Jun 30 2021, 9:34 AM

dcaro added a parent task: T274945: (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet.Jun 30 2021, 10:15 AM

Added the relation with the other one to keep track, but please redo to whatever workflow you prefer (maybe just commenting, inverting the parent-child, another task...).

@Cmjohnson any updates on this?

As far as I can see everything is ok in that machine:

root@cloudcephosd1018:~# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda2[0] sdb2[1]
      234005504 blocks super 1.2 [2/2] [UU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

unused devices: <none>

The icinga link shows green too.

Let me know if you need to do anything or if we can go on and use the server.

Thanks!

@Cmjohnson ping

@dcaro, sorry for the late response, I was out all month. No, there isn't anything left to do, it appears to be working fine now. If it breaks again please re-open the task.

Degraded RAID on cloudcephosd1018Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on cloudcephosd1018
Closed, ResolvedPublic
Actions

Related Objects
Search...