Page MenuHomePhabricator

Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool
Closed, ResolvedPublic

Description

We just received a bunch of new Gen10 hosts (T220572) (https://netbox.wikimedia.org/dcim/devices/?manufacturer_id=6&device_type_id=74) and we will have some more coming.
We realised that hpssacli wasn't working:

1root@db2102:~# hpssacli controller all show config
2
3Error: No controllers detected. Possible causes:
4 - The driver for the installed controller(s) is not loaded.
5 - On LINUX, the scsi_generic (sg) driver module is not loaded.
6 See the README file for more details
7
8root@db2102:~# lsmod | grep sg
9sg 32768 0
10ipmi_msghandler 49152 2 ipmi_devintf,ipmi_si
11scsi_mod 225280 5 smartpqi,sd_mod,ses,scsi_transport_sas,sg

After lots of digging from: T220572#5104134 and till T220572#5106204 we were scared it was a kernel/hw issue until @MoritzMuehlenhoff found out that HP has decided to rename the tool to ssacli:

root@db2102:~# ssacli controller all show config

HPE Smart Array P408i-a SR Gen10 in Slot 0 (Embedded)  (sn: PEYHC0DRHBZ75K)



   Internal Drive Cage at Port 1I, Box 1, OK



   Internal Drive Cage at Port 2I, Box 1, OK


   Port Name: 1I (Mixed)

   Port Name: 2I (Mixed)

   Array A (Solid State SATA, Unused Space: 0  MB)

      logicaldrive 1 (3.49 TB, RAID 1+0adm, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA SSD, 1.9 TB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SATA SSD, 1.9 TB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SATA SSD, 1.9 TB, OK)

@MoritzMuehlenhoff is taking care of the repo to get ssacli installed, but I guess we need to modify the RAID alert handler to also include ssacli on the scripts

Event Timeline

Marostegui renamed this task from Fix RAID handler alert to work with Gen10 hosts to Fix RAID handler alert to work with Gen10 hosts and ssacli tool.Apr 12 2019, 6:37 AM
Marostegui added a project: SRE.
Marostegui added a subscriber: CDanis.

Change 503261 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Sync ssacli from the HPE repository

https://gerrit.wikimedia.org/r/503261

Marostegui renamed this task from Fix RAID handler alert to work with Gen10 hosts and ssacli tool to Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.Apr 12 2019, 6:57 AM

Change 503261 merged by Muehlenhoff:
[operations/puppet@production] Sync ssacli from the HPE repository

https://gerrit.wikimedia.org/r/503261

Mentioned in SAL (#wikimedia-operations) [2019-04-12T07:04:45Z] <moritzm> synced ssacli to thirdparty/hwraid components for jessie/stretch T220787

Mentioned in SAL (#wikimedia-operations) [2019-04-12T07:12:24Z] <marostegui> Manually install ssacli on db2[097|098|099|100|101|102] T220787 T220572

We need to extend the "raid" fact in modules/raid/lib/facter/raid.rb to also detect the Gen10 controller and then return a custom fact (e.g. "ssa"). modules/raid/manifests/init.pp can then be updated in a subsequent step to automatically install the ssacli tool on the Smart Array Gen10 RAID systems.

The Icinga check can probably be adapted if we simply make the CLI tool name a variable (after checking that the syntax is still backwards compatible)

Change 503264 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Update the source distro for the HPE thirdparty suite

https://gerrit.wikimedia.org/r/503264

Change 503264 merged by Muehlenhoff:
[operations/puppet@production] Update the source distro for the HPE thirdparty suite

https://gerrit.wikimedia.org/r/503264

Mentioned in SAL (#wikimedia-operations) [2019-04-12T08:02:25Z] <moritzm> updated ssacli in thirdparty/hwraid component for stretch to 3.30-13.0 T220787

In addition io T220787#5106275, from the top of my head I think we need also:

  • check if the DSA script we're using to alarm on HP raid ( modules/raid/files/dsa-check-hpssacli ) has been updated upstream (Debian) and update it or patch it and send the patch upstream (cc @faidon )
  • adapt modules/raid/files/get-raid-status-hpssacli.sh to detect which executable is available and act accordingly, assuming they have the same options. If not we need to adapt the script to handle the two different exectuables.

In addition io T220787#5106275, from the top of my head I think we need also:

  • check if the DSA script we're using to alarm on HP raid ( modules/raid/files/dsa-check-hpssacli ) has been updated upstream (Debian) and update it or patch it and send the patch upstream (cc @faidon )

The current DSA version only covers hpssacli: https://salsa.debian.org/dsa-team/mirror/dsa-nagios/blob/master/dsa-nagios-checks/checks/dsa-check-hpssacli

This is slightly offtopic, but there is a bit of overlap between the -SMART- checks and the RAID (Megacli/HP) ones, this is out of scope, but mentioning it in case it would be worth deprecating one of the 2 as a fix, or fixing both at the same time if it so requires.

Change 503332 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] ssacli: update raid fact to detect Gen10 devices

https://gerrit.wikimedia.org/r/503332

Change 503333 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] raid: refactor structure

https://gerrit.wikimedia.org/r/503333

Change 503334 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] raid: add ssacli class

https://gerrit.wikimedia.org/r/503334

I have created a series of changes starting with 503332 which adds ssacli to the raids array fact if a "Smart Storage PQI 12G SAS/PCIe 3" devices is detected. This is based on the pci-id [1].

root@db2102:~# grep 9005028f /proc/bus/pci/devices 5c00    9005028f        20              e6c00004                       0                     0                       0                    8001                     0                       0                    8000                     0                       0                       0                   100                       0                       0      smartpqi
root@db2102:~# lspci -nn | grep SCSI
5c:00.0 Serial Attached SCSI controller [0107]: Adaptec Device [9005:028f] (rev 01)

the other two changes are there if you want them

  • 503333 - refactors a bit so the main raid class is a bit more readable
  • 503334 - adds basic class for the ssacli controllers

http://pci-ids.ucw.cz/v2.2/pci.ids

Change 503332 merged by Jbond:
[operations/puppet@production] ssacli: update raid fact to detect Gen10 devices

https://gerrit.wikimedia.org/r/503332

Change 503333 merged by Jbond:
[operations/puppet@production] raid: refactor structure

https://gerrit.wikimedia.org/r/503333

colewhite triaged this task as Medium priority.Apr 16 2019, 6:03 PM

Change 504554 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] RAID: stop processing fact if device found via pci id

https://gerrit.wikimedia.org/r/504554

Change 504554 merged by Jbond:
[operations/puppet@production] RAID: stop processing fact if device found via pci id

https://gerrit.wikimedia.org/r/504554

Change 504586 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] RAID: replace hpssacli with sscli

https://gerrit.wikimedia.org/r/504586

Change 503334 merged by Jbond:
[operations/puppet@production] raid: add ssacli class

https://gerrit.wikimedia.org/r/503334

Change 505760 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] RAID: replace hpssacli with sscli

https://gerrit.wikimedia.org/r/505760

Raid checks now appear to be working with the new ssacli tool. The latest CR (https://gerrit.wikimedia.org/r/505760) would move all cards currently identified as hpsa to use the new ssacli tool as well

Change 504586 abandoned by Jbond:
RAID: replace hpssacli with sscli

Reason:
replaced with 505760

https://gerrit.wikimedia.org/r/504586

Change 505760 abandoned by Jbond:
RAID: replace hpssacli with sscli

Reason:
superseded by alternate change

https://gerrit.wikimedia.org/r/505760

Change 516724 had a related patch set uploaded (by Jbond; owner: Faidon Liambotis):
[operations/puppet@production] dsa-check-hpssacli: import latest version from DSA

https://gerrit.wikimedia.org/r/516724

Change 516725 had a related patch set uploaded (by Jbond; owner: Faidon Liambotis):
[operations/puppet@production] dsa-check-hpssacli: refactor for speed/efficiency

https://gerrit.wikimedia.org/r/516725

Change 516726 had a related patch set uploaded (by Jbond; owner: Faidon Liambotis):
[operations/puppet@production] dsa-check-hpssacli: make compatible with ssacli

https://gerrit.wikimedia.org/r/516726

Change 516724 merged by Filippo Giunchedi:
[operations/puppet@production] dsa-check-hpssacli: import latest version from DSA

https://gerrit.wikimedia.org/r/516724

Change 516725 merged by Filippo Giunchedi:
[operations/puppet@production] dsa-check-hpssacli: refactor for speed/efficiency

https://gerrit.wikimedia.org/r/516725

Change 516726 merged by Filippo Giunchedi:
[operations/puppet@production] dsa-check-hpssacli: make compatible with ssacli

https://gerrit.wikimedia.org/r/516726

fgiunchedi claimed this task.
fgiunchedi subscribed.

AFAICT this is good to resolve, please feel free to reopen if that's not the case