Page MenuHomePhabricator

Degraded RAID on mw2279
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2279. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1](F)
      234298368 blocks super 1.2 [2/1] [U_]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>

Event Timeline

[Tue Oct  6 06:28:23 2020] ata2.00: failed command: READ FPDMA QUEUED
[Tue Oct  6 06:28:23 2020] ata2.00: cmd 60/80:00:00:a9:f7/00:00:03:00:00/40 tag 0 ncq dma 65536 in
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[Tue Oct  6 06:28:23 2020] ata2.00: status: { DRDY }

<snip>

[Tue Oct  6 06:29:35 2020] ata2: hard resetting link
[Tue Oct  6 06:29:35 2020] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[Tue Oct  6 06:29:35 2020] ata2.00: configured for UDMA/133
[Tue Oct  6 06:29:35 2020] ata2: EH complete
[Tue Oct  6 06:29:40 2020] Process accounting resumed
[Tue Oct  6 06:30:11 2020] ata2: limiting SATA link speed to 3.0 Gbps
[Tue Oct  6 06:30:11 2020] ata2.00: exception Emask 0x0 SAct 0x7ff20009 SErr 0x0 action 0x6 frozen
[Tue Oct  6 06:30:11 2020] ata2.00: failed command: WRITE FPDMA QUEUED
[Tue Oct  6 06:30:11 2020] ata2.00: cmd 61/02:00:10:08:00/00:00:00:00:00/40 tag 0 ncq dma 1024 out
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Tue Oct  6 06:30:11 2020] ata2.00: status: { DRDY }

[Tue Oct  6 06:30:11 2020] ata2: hard resetting link
[Tue Oct  6 06:30:17 2020] ata2: link is slow to respond, please be patient (ready=0)
[Tue Oct  6 06:30:21 2020] ata2: COMRESET failed (errno=-16)

 <snip>

[Tue Oct  6 06:30:37 2020] ata2: link is slow to respond, please be patient (ready=0)
[Tue Oct  6 06:31:07 2020] ata2: COMRESET failed (errno=-16)
[Tue Oct  6 06:31:07 2020] ata2: limiting SATA link speed to 1.5 Gbps
[Tue Oct  6 06:31:07 2020] ata2: hard resetting link
[Tue Oct  6 06:31:12 2020] ata2: COMRESET failed (errno=-16)
[Tue Oct  6 06:31:12 2020] ata2: reset failed, giving up
[Tue Oct  6 06:31:12 2020] ata2.00: disabled
[Tue Oct  6 06:31:12 2020] ata2: EH complete
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#18 CDB: Write(10) 2a 00 0a 9f f8 d0 00 00 10 00
[Tue Oct  6 06:31:12 2020] blk_update_request: I/O error, dev sdb, sector 178256080
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 00 00 08 10 00 00 02 00
[Tue Oct  6 06:31:12 2020] blk_update_request: I/O error, dev sdb, sector 2064
[Tue Oct  6 06:31:12 2020] md: super_written gets error=-5
[Tue Oct  6 06:31:12 2020] md/raid1:md0: Disk failure on sdb1, disabling device.
                           md/raid1:md0: Operation continuing on 1 devices.
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Oct  6 06:31:12 2020] sd 1:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 03 f7 a9 00 00 00 80 00
[Tue Oct  6 06:31:12 2020] blk_update_request: I/O error, dev sdb, sector 66562304

<snip>

Mentioned in SAL (#wikimedia-operations) [2020-10-06T10:48:07Z] <effie> set mw2279.codfw.wmnet as inactive T264698

According to netbox, this server is still under warranty.

@Papaul the server is set as inactive, let me know if you need anything from me, thank you!

Papaul triaged this task as Medium priority.Oct 13 2020, 12:42 PM

@jijiki I will request a disk replacement

21:08 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions 
                   https://wikitech.wikimedia.org/wiki/Application_servers
21:09 < mutante> interesting. is that being used as a test host right now?


21:12 <+icinga-wm> ACKNOWLEDGEMENT - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions daniel_zahn 
                   https://phabricator.wikimedia.org/T264698 https://wikitech.wikimedia.org/wiki/Application_servers

[cumin1001:~] $ sudo -i cookbook sre.hosts.downtime -r T264698 -H 24 mw2279.codfw.wmnet
START - Cookbook sre.hosts.downtime
Downtiming 1 hosts and all their services for 1 day, 0:00:00: mw2279.codfw.wmnet

Create Dispatch: Success
You have successfully submitted request SR1039679642.

@jijiki disk replaced

Return tracking information

Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts:

mw2279.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010161152_jiji_22778_mw2279_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mw2279.codfw.wmnet']

and were ALL successful.