Page MenuHomePhabricator

Test RAID monitoring on new RAID PERC 755 controllers
Open, MediumPublic

Description

We are getting a new RAID controller PERC 755 which needs to use perccli64 instead of megacli as a tooling to interact with RAID/BBU.
To make sure our automatic raid monitoring + phabricator task creation works out fine we'd need to pull out a disk of one of the new hosts and check if the alert+task work as expected.

We can use db1206 as a testing host in January.

Once we are ready to do this, we need to ping eqiad DCOps, not tagging them for now until we are ready from our end to arrange a date/time.

Event Timeline

Marostegui changed the task status from Open to Stalled.Dec 13 2022, 8:52 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Blocked on the DBA board.

Stalling until we are back from end of year holidays and production freeze.

Marostegui moved this task from In progress to Blocked on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-01-10T22:09:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1206 T325046', diff saved to https://phabricator.wikimedia.org/P42980 and previous config saved to /var/cache/conftool/dbconfig/20230110-220942-marostegui.json

Marostegui changed the task status from Stalled to Open.Tue, Jan 10, 10:10 PM
Marostegui reassigned this task from Marostegui to Jclark-ctr.
Marostegui added a project: ops-eqiad.
Marostegui moved this task from Blocked to In progress on the DBA board.
Marostegui added a subscriber: Jclark-ctr.

@Jclark-ctr we want to test that the RAID monitoring works fine. Can you pull out a hard disk from db1206 and leave it out until we get the degraded RAID task created? (I can let you know when that happens).
The host is depooled, so you can go ahead and do it whenever you want.

@Jclark-ctr could you provide a rough timeline on when we could expect this to happen? Thanks!

Talked to John about it, we'll try to get it done this week :)

Pulled drive will advise when it can be reinserted

@Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output where you get the disk that failed like in the previous tasks with the old controller.
Is that something we can fix to work with this new one?

@Volans @MoritzMuehlenhoff so the task about the degraded RAID gets created correctly (T327902). It would be nice to get the usual output where you get the disk that failed like in the previous tasks with the old controller.
Is that something we can fix to work with this new one?

Yeah, I'll look into that in the next days.

Thanks Moritz, do you need the disk to be left out?

We chatted on IRC and we are leaving the disk on a failed state for now until @MoritzMuehlenhoff is done with his tests.

Thanks Moritz, do you need the disk to be left out?

Yeah, let's keep it for a few days, so that I can use it to test the additional output for the alert.

Change 883600 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] perccli: Print human-readable topology information on disk failure

https://gerrit.wikimedia.org/r/883600

Change 883600 merged by Muehlenhoff:

[operations/puppet@production] perccli: Print human-readable topology information on disk failure

https://gerrit.wikimedia.org/r/883600

I see it rebuilding, I will ping you once the alert recovers so we can pull it out again:

perccli64 /c0 show rebuildrate
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = None


Controller Properties :
=====================

------------------
Ctrl_Prop   Value
------------------
Rebuildrate 30%
------------------

I pasted the wrong command above:

root@db1206:~# perccli64 /c0/e252/s2 show rebuild
CLI Version = 007.1910.0000.0000 Oct 08, 2021
Operating system = Linux 5.10.0-19-amd64
Controller = 0
Status = Success
Description = Show Drive Rebuild Status Succeeded.


------------------------------------------------------
Drive-ID    Progress% Status      Estimated Time Left
------------------------------------------------------
/c0/e252/s2        77 In progress 11 Minutes
------------------------------------------------------

RAID is now back in optimal status, waiting for Icinga to recover before pulling the disk out again

VD LIST :
=======

--------------------------------------------------------------
DG/VD TYPE   State Access Consist Cache Cac sCC     Size Name
--------------------------------------------------------------
0/239 RAID10 Optl  RW     Yes     RWBD  -   OFF 8.729 TB
--------------------------------------------------------------
root@db1206:~#  sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli
communication: 0 OK | controller: 0 OK | physical_disk: 0 OK | virtual_disk: 0 OK | bbu: 0 OK | enclosure: 0 OK

@Jclark-ctr whenever you can, pull the disk out again. Thank you

The task gets generated fine, but still a bit unreadable as show on T328135
Leaving this task open until @MoritzMuehlenhoff takes a look at that ouput.