Page MenuHomePhabricator

Add RAID monitoring for HP servers
Closed, ResolvedPublic

Description

We currently do not monitor RAID status for HP servers. Seeing how we have a lot of HP servers nowadays, we're running blind and this is a major problem.

HP ships .debs in a their apt with their monitoring tools (hpssacli, previously known as hpacucli). We'll need to import that into our repository (with reprepro updates) and add an Icinga check to use it.

We could either adjust check-raid.py to handle it (also see T84050) or use a separate plugin. Debian's DSA team has a monitoring/nagios check for hpcacucli that was pretty good last time I tried it; we could potentially use that.

Event Timeline

faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon subscribed.

Making this a blocker for codfw rollout due to hard dependency from MySQL HP servers on dallas.

AFAIK, hpcacucli is non-free. This is the basic, free, debian-included option to do that:

$ cciss_vol_status --verbose /dev/sg0
Controller: Smart Array P420i
  Board ID: 0x3354103c
  Logical drives: 1
  Running firmware: 6.00
  ROM firmware: 6.00
/dev/sda: (Smart Array P420i) RAID 1 Volume 0 status: OK. 
  Physical drives: 12
         connector 1I box 1 bay 1                 HP      EF0600FARNA                          6SL9JV750000N5160B08     HPD6 OK
         connector 1I box 1 bay 2                 HP      EF0600FARNA                          6SL9M6YZ0000N5214MTD     HPD6 OK
         connector 1I box 1 bay 3                 HP      EF0600FARNA                          6SL9LD870000N5214MNX     HPD6 OK
         connector 1I box 1 bay 4                 HP      EF0600FARNA                          6SL9LCS70000N5214T7U     HPD6 OK
         connector 1I box 1 bay 5                 HP      EF0600FARNA                          6SL9LCYR0000N5214Q2P     HPD6 OK
         connector 1I box 1 bay 6                 HP      EF0600FARNA                          6SL9LMD90000N52020XE     HPD6 OK
         connector 1I box 1 bay 7                 HP      EF0600FARNA                          6SL9LD1J0000N5214EJA     HPD6 OK
         connector 1I box 1 bay 8                 HP      EF0600FARNA                          6SL9M6XS0000N5214MT0     HPD6 OK
         connector 1I box 1 bay 9                 HP      EF0600FARNA                          6SL9LCP50000N5214DLQ     HPD6 OK
         connector 1I box 1 bay 10                 HP      EF0600FARNA                          6SL9LS7B0000N52024WH     HPD6 OK
         connector 1I box 1 bay 11                 HP      EF0600FARNA                          6SL9LTF20000N5206WXN     HPD6 OK
         connector 1I box 1 bay 12                 HP      EF0600FARNA                          6SL9CK0B0000N5206T0F     HPD6 OK
/dev/sg0: (Smart Array P420i) Enclosure Gen8 ServBP 12+2 (S/N: FZ4ABP5984) on Bus 0, Physical Port 1I status: OK.
/dev/sg0(Smart Array P420i:0): Non-Volatile Cache status:
                   Cache configured: Yes
                  Read cache memory: 81 MiB
                 Write cache memory: 735 MiB
                Write cache enabled: Yes
   Flash backed cache present

cciss-vol-status is GPL:

$ cat /usr/share/doc/cciss-vol-status/copyright | grep GPL
Public License version 2 can be found in `/usr/share/common-licenses/GPL-2'.
and is licensed under the GPL version 2 or (at your option) any later version,

On salt:

salt -E 'db20(3[5-9]|[4567][0-9]).codfw.wmnet' cmd.run 'cciss_vol_status --verbose /dev/sg0 | grep /dev/sda'

Change 267262 had a related patch set uploaded (by Faidon Liambotis):
reprepro: add HP's MCP repository to updates

https://gerrit.wikimedia.org/r/267262

AFAIK, hpcacucli is non-free. This is the basic, free, debian-included option to do that:

That's not actually a big problem — most of the RAID tools we use are (e.g. megacli) and we have a thirdparty section in our repository just for this. It's a compromise we unfortunately have to make.

I've pushed a patch to add the HP tools to our repository, check_raid should be adjusted next (or a separate check should be used, see T84050).

Change 267262 merged by Filippo Giunchedi:
reprepro: add HP's MCP repository to updates

https://gerrit.wikimedia.org/r/267262

also ccics_vol_status seems to be limited to hp dl380 gen8, on a dl380 gen9 e.g. ms-be2020 it doesn't work

$ sudo cciss_vol_status /dev/sg0
cciss_vol_status: Warning: unknown controller type 0x21cb103c
cciss_vol_status: /dev/sg0: Unknown controller, board_id = 0x21cb103c

removing patch-for-review since it was since merged, ticket still open

moving to @faidon since he mentioned he was working on it

@faidon would you agree to add in the meanwhile hpssacli to the list of installed packages so that at least we can do checks when needed manually or through salt?

I saw that in modules/base/manifests/monitoring/host.pp we install a bunch of them for different vendors, it could be added directly there for now. And later we could make it cleaner so that only the proper one is installed on each host.

Thoughts?

Change 290717 had a related patch set uploaded (by Volans):
Monitoring: Install vendor specific RAID tool

https://gerrit.wikimedia.org/r/290717

Change 290987 had a related patch set uploaded (by Faidon Liambotis):
raid: add HP's RAID tool to the list

https://gerrit.wikimedia.org/r/290987

Change 291014 had a related patch set uploaded (by Faidon Liambotis):
raid: add monitoring for HP controllers

https://gerrit.wikimedia.org/r/291014

Change 290987 abandoned by Faidon Liambotis:
raid: add HP's RAID tool to the list

Reason:
Done properly by folding into https://gerrit.wikimedia.org/r/#/c/291014/

https://gerrit.wikimedia.org/r/290987

Change 290717 abandoned by Volans:
Monitoring: Install vendor specific RAID tool

https://gerrit.wikimedia.org/r/290717

Change 291014 merged by Faidon Liambotis:
raid: add monitoring for HP controllers

https://gerrit.wikimedia.org/r/291014

It took a while but this is finally done. We now have 123 RAID checks for HP systems.