Page MenuHomePhabricator

Degraded RAID on ms-be1016
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host ms-be1016. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb2[1] sda2[0]
      976320 blocks super 1.2 [2/2] [UU]
      	resync=PENDING
      
md0 : active raid1 sda1[0](F) sdb1[1]
      58559488 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

Dzahn subscribed.

server is unusable, ssh to it results in:

-bash: /usr/bin/tput: Input/output error
-bash: cannot create temp file for here-document: Read-only file system
-bash: %6+1: syntax error: operand expected (error token is "%6+1")
Connection to ms-be1016.eqiad.wmnet closed.

Dzahn, can you assign a priority for this ticket? Is 'normal' appropriate for Swift backend hosts?

Dzahn triaged this task as High priority.Jan 15 2019, 9:05 PM
Dzahn added a subscriber: fgiunchedi.

That's a good question. It looks like ms-be 17, 18 and 19 can still handle the load but given that it's not just a regular degraded RAID but the whole server is read-only and to err on the side of caution.. i say High and add @fgiunchedi to it. He can tell us if we can lower it back to normal.

Mentioned in SAL (#wikimedia-operations) [2019-01-16T09:52:59Z] <godog> upgrade controller firmware on ms-be1016 - T213856

fgiunchedi claimed this task.

Looks like the raid controller freaked out, a reboot "fixed" it. I've upgraded the firmware too: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0#RAID_controller_firmware_upgrade