Test RAID monitoring on new RAID PERC 755 controllers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Dec 13 2022, 8:51 AM

Description

We are getting a new RAID controller PERC 755 which needs to use perccli64 instead of megacli as a tooling to interact with RAID/BBU.
To make sure our automatic raid monitoring + phabricator task creation works out fine we'd need to pull out a disk of one of the new hosts and check if the alert+task work as expected.

We can use db1206 as a testing host in January.

Once we are ready to do this, we need to ping eqiad DCOps, not tagging them for now until we are ready from our end to arrange a date/time.

Details

	Subject	Repo	Branch	Lines +/-
	raid_handler: Use universal_newlines	operations/puppet	production	+1 -1
	perccli: Print human-readable topology information on disk failure	operations/puppet	production	+11 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Marostegui	T324181 Test new PERC 755 controller on DB hosts
Resolved	• MoritzMuehlenhoff	T325046 Test RAID monitoring on new RAID PERC 755 controllers
Declined	None	T327902 Degraded RAID on db1206
Invalid	None	T328135 Degraded RAID on db1206

Event Timeline

• Marostegui created this task.Dec 13 2022, 8:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2022, 8:51 AM

Stalling until we are back from end of year holidays and production freeze.

• Marostegui moved this task from Blocked to In progress on the DBA board.Dec 14 2022, 7:36 AM

• Marostegui moved this task from In progress to Blocked on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-01-10T22:09:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1206 T325046', diff saved to https://phabricator.wikimedia.org/P42980 and previous config saved to /var/cache/conftool/dbconfig/20230110-220942-marostegui.json

@Jclark-ctr we want to test that the RAID monitoring works fine. Can you pull out a hard disk from db1206 and leave it out until we get the degraded RAID task created? (I can let you know when that happens).
The host is depooled, so you can go ahead and do it whenever you want.

Maintenance_bot added a project: SRE.Jan 10 2023, 10:29 PM

@Jclark-ctr could you provide a rough timeline on when we could expect this to happen? Thanks!

• Marostegui mentioned this in T327859: Switch s1 sanitarium master from db1206 to db1196.Jan 25 2023, 7:44 AM

Talked to John about it, we'll try to get it done this week :)

Pulled drive will advise when it can be reinserted

• Marostegui added a subtask: T327902: Degraded RAID on db1206.Jan 25 2023, 1:36 PM

@Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output where you get the disk that failed like in the previous tasks with the old controller.
Is that something we can fix to work with this new one?

• Marostegui closed subtask T327902: Degraded RAID on db1206 as Declined.Jan 25 2023, 1:38 PM

• Marostegui mentioned this in T327902: Degraded RAID on db1206.

In T325046#8557233, @Marostegui wrote:

@Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output where you get the disk that failed like in the previous tasks with the old controller.
Is that something we can fix to work with this new one?

Yeah, I'll look into that in the next days.

Thanks Moritz, do you need the disk to be left out?

We chatted on IRC and we are leaving the disk on a failed state for now until @MoritzMuehlenhoff is done with his tests.

In T325046#8557292, @Marostegui wrote:

Thanks Moritz, do you need the disk to be left out?

Yeah, let's keep it for a few days, so that I can use it to test the additional output for the alert.

Change 883600 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] perccli: Print human-readable topology information on disk failure

https://gerrit.wikimedia.org/r/883600

gerritbot added a project: Patch-For-Review.Jan 25 2023, 4:20 PM

Change 883600 merged by Muehlenhoff:

[operations/puppet@production] perccli: Print human-readable topology information on disk failure

https://gerrit.wikimedia.org/r/883600

Maintenance_bot removed a project: Patch-For-Review.Jan 26 2023, 11:31 AM

@Jclark-ctr can you add the disk back?

Drive 2 reinserted.

I see it rebuilding, I will ping you once the alert recovers so we can pull it out again:

perccli64 /c0 show rebuildrate
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = None


Controller Properties :
=====================

------------------
Ctrl_Prop   Value
------------------
Rebuildrate 30%
------------------

I pasted the wrong command above:

root@db1206:~# perccli64 /c0/e252/s2 show rebuild
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = Show Drive Rebuild Status Succeeded.


------------------------------------------------------
Drive-ID    Progress% Status      Estimated Time Left
------------------------------------------------------
/c0/e252/s2        77 In progress 11 Minutes
------------------------------------------------------

RAID is now back in optimal status, waiting for Icinga to recover before pulling the disk out again

VD LIST :
=======

--------------------------------------------------------------
DG/VD TYPE   State Access Consist Cache Cac sCC     Size Name
--------------------------------------------------------------
0/239 RAID10 Optl  RW     Yes     RWBD  -   OFF 8.729 TB
--------------------------------------------------------------

root@db1206:~#  sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli
communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enclosure: 0 OK

@Jclark-ctr whenever you can, pull the disk out again. Thank you

Drive pulled again

Thank you!

Drive has been Reinserted

The task gets generated fine, but still a bit unreadable as show on T328135
Leaving this task open until @MoritzMuehlenhoff takes a look at that ouput.

• Marostegui added a subtask: T328135: Degraded RAID on db1206.Jan 30 2023, 6:58 AM

• Marostegui reassigned this task from Jclark-ctr to • MoritzMuehlenhoff.Feb 3 2023, 7:31 AM

Change 888212 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] raid_handler: Use universal_newlines

https://gerrit.wikimedia.org/r/888212

gerritbot added a project: Patch-For-Review.Feb 10 2023, 12:57 PM

Change 888212 merged by Muehlenhoff:

[operations/puppet@production] raid_handler: Use universal_newlines

https://gerrit.wikimedia.org/r/888212

@Jclark-ctr can you pull the disk out again for another test? Thanks

Maintenance_bot removed a project: Patch-For-Review.Feb 13 2023, 8:31 AM

@Marostegui removed drive

The auto-generated task looks good now T329522
@Jclark-ctr can you insert the disk again whenever you've got time? Thanks!

I will close this task once the disk is back and the RAID is back to Optimal. Thanks @MoritzMuehlenhoff for all the help

John has inserted the disk back and it is rebuilding:

root@db1206:~# perccli64 /c0/e252/s2 show rebuild
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = Show Drive Rebuild Status Succeeded.


------------------------------------------------------
Drive-ID    Progress% Status      Estimated Time Left
------------------------------------------------------
/c0/e252/s2        12 In progress 42 Minutes
------------------------------------------------------


root@db1206:~#

I'll close this ticket once the RAID is back to optimal.
Thanks for all the help @Jclark-ctr

Thanks everyone!

Maintenance_bot moved this task from In progress to Done on the DBA board.Feb 14 2023, 7:15 AM

• MoritzMuehlenhoff mentioned this in T315608: icinga raid monitoring inoperable for H750 controllers.Jun 21 2023, 12:16 PM

Test RAID monitoring on new RAID PERC 755 controllersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Test RAID monitoring on new RAID PERC 755 controllers
Closed, ResolvedPublic
Actions

Related Objects
Search...