Page MenuHomePhabricator

Broken disk on ms-be2026
Closed, ResolvedPublic

Description

ms-be2026 has a disk failure, which doesn't seem to have been handled gracefully by mdadm. Also, our tool which creates tasks for a broken disk seems not have triggered here.
The server is up and e.g. the Icinga check for SSH claims to be fine, but an actual login fails with an Input/output error.

When logging in over the serial console it can be seen that it's emitting the following error every second:

[14332545.636171] sd 0:1:0:1: rejecting I/O to offline device

Folllowed by occasional errors like:

[14332545.665788] EXT4-fs error (device md0): __ext4_get_inode_loc:4363: inode #526249: block 2097306: comm systemd: unable to read itable block
[14332545.732880] EXT4-fs (md0): previous I/O error to superblock detected

Event Timeline

I can ssh into it via cumin.

The MD raid status is this:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[0](F) sdb1[1]
      58559488 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[1]
      976320 blocks super 1.2 [2/2] [UU]

unused devices: <none>

While the HP raid is quite weird:

$ /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Error: The specified device does not have any logical drives.

And the exit code in this case is 0 and must be fixed. I'll investigate a bit more into it.

The reason why the automatic task was not created are:

  • the HP raid is not alarming on Icinga due to the above issue
  • at the time the alarm for the MD device passed from SOFT to HARD it was not reporting the current status but the usual connection issue that so often we have on the ms hosts. In those cases the raid handler skips the creation of the task to avoid a lot of false positives. See the log below for context:
2019-04-01 20:07:46 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2026', message='CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer', message_remain='', raid_type='md', service_attempts=3, service_description='MD RAID', service_state='CRITICAL', service_state_type='HARD', skip_nrpe=False)
2019-04-01 20:07:46 [INFO] raid_handler::main: Skipping RAID Handler execution for host 'ms-be2026' and RAID type 'md', skip string 'Connection reset by peer' detected in 'CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer'
2019-04-01 20:09:06 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2026', message='CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.', message_remain='', raid_type='md', service_attempts=3, service_description='MD RAID', service_state='UNKNOWN', service_state_type='HARD', skip_nrpe=False)
2019-04-01 20:09:06 [DEBUG] raid_handler::main: Nothing to do, exiting
2019-04-01 20:10:22 [DEBUG] raid_handler::main: RAID Handler called with args: Namespace(datacenter='codfw', debug=True, host_address='ms-be2026', message='CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer', message_remain='', raid_type='md', service_attempts=3, service_description='MD RAID', service_state='CRITICAL', service_state_type='HARD', skip_nrpe=False)
2019-04-01 20:10:22 [INFO] raid_handler::main: Skipping RAID Handler execution for host 'ms-be2026' and RAID type 'md', skip string 'Connection reset by peer' detected in 'CHECK_NRPE: Error - Could not connect to 10.192.48.62: Connection reset by peer'

So the dsa-check-hpssacli check is happily returning 0 exit code and this output:

OK: Slot 0: no logical drives --- Slot 0: no drives

Given that IIRC we add the HP raid check only on the hosts that have it, we might consider patching this imported script to fails in the case there is a controller but has no drives configured (both no logical and no physical?)

For the get script that the raid handler uses to gather the output I'm sending a patch to make it exit with the correct exit status, fwiw.

Change 500684 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] RAID: hpssacli exit with correct code

https://gerrit.wikimedia.org/r/500684

Mentioned in SAL (#wikimedia-operations) [2019-04-02T13:24:37Z] <volans> reboot ms-be2026 to see if that fixes the controller - T219854

After the reboot the host is back up and running, all seems good so far. Keeping open for a bit to see if it holds.

Forgot to mention that during the reboot it printed:

Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V3.56) 14 Logical
Drive(s) - Operation Failed
 - 1719-Slot 3 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

@Papaul is this something we've already seen and for which there is a firmware upgrade that could help avoid it?

@Volans the server is running on old firmware.

HPE Smart Storage Battery 1 Firmware 1.1 Embedded
iLO 2.40 Dec 02 2015 System Board
Intelligent Platform Abstraction Data 20.3 System Board
Intelligent Provisioning N/A System Board
Power Management Controller Firmware 1.0.9 System Board
Power Management Controller FW Bootloader 1.0 System Board
Redundant System ROM P89 v2.00 (12/27/2015) System Board
Server Platform Services (SPS) Firmware 3.1.3.21.0 System Board
Smart Array P840 Controller 3.56 Slot 3
System Programmable Logic Device Version 0x34 System Board
System ROM P89 v2.00 (12/27/2015) System Board

There are some firmwares upgrade available for this system. we can coordinate when i am on site tomorrow to upgrade the firmware

Mentioned in SAL (#wikimedia-operations) [2019-04-03T15:18:12Z] <volans> shutdown ms-be2026 for firmware upgrade - T219854

HP FlexFabric 10Gb 2port 534FLR-SFP+ Adapter 7.17.19 Embedded
HPE Smart Storage Battery 1 Firmware 1.1 Embedded
iLO 2.60 May 23 2018 System Board
Intelligent Platform Abstraction Data 25.13 System Board
Intelligent Provisioning N/A System Board
Power Management Controller Firmware 1.0.9 System Board
Power Management Controller FW Bootloader 1.0 System Board
Redundant System ROM P89 v2.00 (12/27/2015) System Board
Server Platform Services (SPS) Firmware 3.1.3.21.0 System Board
Smart Array P840 Controller 6.60 Slot 3
System Programmable Logic Device Version 0x34 System Board
System ROM P89 v2.60 (05/21/2018) System Board

Change 500684 merged by Volans:
[operations/puppet@production] RAID: hpssacli exit with correct code

https://gerrit.wikimedia.org/r/500684

@fgiunchedi what are your thoughts on T219854#5076968 ? That's the last remaining part of this task I guess.

So the dsa-check-hpssacli check is happily returning 0 exit code and this output:

OK: Slot 0: no logical drives --- Slot 0: no drives

Given that IIRC we add the HP raid check only on the hosts that have it, we might consider patching this imported script to fails in the case there is a controller but has no drives configured (both no logical and no physical?)

Agreed, seems like a sensible thing to do, also upstream would be interested I think.

fgiunchedi triaged this task as Medium priority.Apr 9 2019, 8:37 AM

So the dsa-check-hpssacli check is happily returning 0 exit code and this output:

OK: Slot 0: no logical drives --- Slot 0: no drives

Given that IIRC we add the HP raid check only on the hosts that have it, we might consider patching this imported script to fails in the case there is a controller but has no drives configured (both no logical and no physical?)

Agreed, seems like a sensible thing to do, also upstream would be interested I think.

It appears that check_raid also behaves in this manner so it could have been desired for some reason in the past?

if numLD == 0:
    print 'OK: no disks configured for RAID'
    return 0

https://github.com/wikimedia/puppet/blob/production/modules/raid/files/check-raid.py#L202-L204

I think we have some exception hosts that have 2 controllers but only one is in use, but I'm not sure 100%

This task covers many things, AFAICT the original hardware issues are fixed and https://gerrit.wikimedia.org/r/500684 is merged, so good to close?

wiki_willy assigned this task to Papaul.
wiki_willy added a subscriber: wiki_willy.

Looks like things are resolved here, so I'm going to resolve the task, but feel free to reopen if there's still something that needs to be completed.

Thanks,
Willy