hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet
Closed, ResolvedPublicRequest
Actions

Assigned To

Authored By

	RKemper
	Apr 29 2021, 12:18 AM

Description

Host => wdqs2007.eqiad.wmnet
(Purchased 2019-12-12 so should be in warranty)

- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- Put system into a failed state in Netbox: https://netbox.wikimedia.org/dcim/devices/2396/
- Urgency: High We don't maintain a lot of capacity in codfw wdqs (only 4 hosts) so having one offline is pretty painful
- Host keeps entering ssh unreachable state. The host was found to be unreachable, I powercycled it which successfully got ssh working again, and then a few hours later it dropped back into ssh unreachability.
- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

P15622 wdqs2007

1	wdqs2007 login: root
2
3	[11508.787406] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[11508.794552] ata10.00: failed command: WRITE DMA EXT[11508.799454] ata10.00: cmd 35/00:01:59:b5:8d/00:00:14:00:00/e0 tag 21 dma 512 out
4	[11508.799454] res 40/00:42:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)[11508.814226] ata10.00: status: { DRDY }[11509.130995] print_req_error: I/O error, dev sdh, sector 344831321
5	[11539.519042] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[11539.526192] ata10.00: failed command: WRITE DMA EXT[11539.531092] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 16 dma 512 out
6	[11539.531092] res 40/00:42:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)[11539.545861] ata10.00: status: { DRDY }[11570.238495] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
7	[11570.245650] ata10.00: failed command: WRITE DMA EXT[11570.250543] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 30 dma 512 out[11570.250543] res 40/00:f4:7a:b4:8d/00:00:14:00:00/40 Emask 0x4 (timeout)
8	[11570.265319] ata10.00: status: { DRDY }q[11600.946039] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen[11600.953191] ata10.00: failed command: WRITE DMA EXT
9	[11600.958085] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 19 dma 512 out
10	[11600.958085] res 40/00:9c:cd:b4:8d/00:00:14:00:00/40 Emask 0x4 (timeout)
11	[11600.972856] ata10.00: status: { DRDY }
12	[11601.310083] ata10.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen
13	[11601.317748] ata10.00: irq_stat 0x08000000, interface fatal error
14	[11601.323762] ata10: SError: { UnrecovData Handshk }
15	[11601.328561] ata10.00: failed command: WRITE DMA EXT
16	[11601.333448] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 8 dma 512 out
17	[11601.333448] res 50/00:42:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
18	[11601.348731] ata10.00: status: { DRDY }
19	[11633.713554] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
20	[11633.720700] ata10.00: failed command: WRITE DMA EXT
21	[11633.725604] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 18 dma 512 out
22	[11633.725604] res 40/00:94:74:b4:8d/00:00:14:00:00/40 Emask 0x4 (timeout)
23	[11633.740368] ata10.00: status: { DRDY }
24	[11664.445099] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
25	[11664.452251] ata10.00: failed command: WRITE DMA EXT
26	[11664.457152] ata10.00: cmd 35/00:01:5a:b5:8d/00:00:14:00:00/e0 tag 17 dma 512 out
27	[11664.457152] res 40/00:8c:af:6b:8d/00:00:14:00:00/40 Emask 0x4 (timeout)
28	[11664.471924] ata10.00: status: { DRDY }
29	[11664.788700] print_req_error: I/O error, dev sdh, sector 344831322

racadm>>racadm getsel
Record:      1
Date/Time:   12/12/2019 11:05:12
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/20/2020 15:56:09
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/20/2020 15:56:11
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/20/2020 15:56:15
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/20/2020 15:56:16
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/20/2020 15:56:49
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/20/2020 15:56:55
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   02/20/2020 15:57:00
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   02/20/2020 15:57:04
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   02/20/2020 16:01:19
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   02/20/2020 16:01:24
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   02/20/2020 16:02:55
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   02/20/2020 16:02:56
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   02/20/2020 16:03:00
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   02/20/2020 16:03:01
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   02/11/2021 17:33:57
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   02/11/2021 17:34:02
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   02/11/2021 17:34:15
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   02/11/2021 17:34:16
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   02/11/2021 17:34:25
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   02/11/2021 17:34:26
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Request	RobH	T281437 hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet
		Resolved		RKemper	T281504 Degraded RAID on wdqs2007

Event Timeline

RKemper created this task.Apr 29 2021, 12:18 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2021, 12:18 AM

Maintenance_bot added a project: SRE.Apr 29 2021, 12:45 AM

RKemper updated the task description. (Show Details)Apr 29 2021, 12:57 AM

RKemper updated the task description. (Show Details)Apr 29 2021, 1:18 AM

Mentioned in SAL (#wikimedia-operations) [2021-04-29T01:19:11Z] <ryankemper> T280382 Aborted data transfer; wdqs2007 is hosed (see https://phabricator.wikimedia.org/T281437)

Drained power from the host, the host is back online. Please change the status in Netbox when the server is backup in service .

Thanks

Dzahn added a subtask: T281504: Degraded RAID on wdqs2007.May 3 2021, 7:14 PM

RKemper closed subtask T281504: Degraded RAID on wdqs2007 as Resolved.May 3 2021, 8:35 PM

RKemper reopened this task as Open.May 5 2021, 1:44 AM

@Papaul The host became ssh unreachable again. It looks like there might be something wrong with the underlying hardware. As such I've reopened this ticket.

I just power-cycled wdqs2007 so I'll keep an eye out for if it seemed to be a one-off event, but I'd be surprised if there weren't something wrong with the underlying hardware.

@RKemper Will try to upgrade the firmware on it tomorrow or Friday

Kizule merged a task: T281956: Degraded RAID on wdqs2007.May 5 2021, 4:16 AM

Kizule added a subscriber: ops-monitoring-bot.

Before
BIOS: 2.4.8
IDRAC: 4.00

After
BIOS: 2.9.3
IDRAC: 4.22

@RKemper firmware upgrade complete on the host. Resolving the task for now if we still have the same issue please fell free to re-open

T282068 is another duplicate of this, right?

Mentioned in SAL (#wikimedia-operations) [2021-05-12T19:05:20Z] <ryankemper> T280382 T281437 sudo -i wmf-auto-reimage-host -p T280382 wdqs2007.codfw.wmnet on ryankemper@cumin2001 tmux session wdqs_reimage

TL;DR: disk:3 serial # 193023009C1B is giving us issues

@Papaul This host is still in a bad state. It looks like one of the RAID drives is failing:

ryankemper@wdqs2007:~$ sudo /sbin/mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Wed May 12 19:12:32 2021
        Raid Level : raid10
        Array Size : 3749068800 (3575.39 GiB 3839.05 GB)
     Used Dev Size : 937267200 (893.85 GiB 959.76 GB)
      Raid Devices : 8
     Total Devices : 8
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Thu May 13 23:13:34 2021
             State : active, degraded
    Active Devices : 7
   Working Devices : 7
    Failed Devices : 1
     Spare Devices : 0

            Layout : near=2
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : wdqs2007:0  (local to host wdqs2007)
              UUID : 32d23424:f0969b81:4f5462b8:124da1bd
            Events : 109351

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync set-A   /dev/sda2
       1       8       18        1      active sync set-B   /dev/sdb2
       2       8       34        2      active sync set-A   /dev/sdc2
       3       8       50        3      active sync set-B   /dev/sdd2
       4       8       66        4      active sync set-A   /dev/sde2
       5       8       82        5      active sync set-B   /dev/sdf2
       6       8       98        6      active sync set-A   /dev/sdg2
       -       0        0        7      removed

       7       8      114        -      faulty   /dev/sdh2

so /dev/sdh2 is faulty

This explains why I see 1 failure here:

ryankemper@wdqs2007:~$ sudo /usr/local/lib/nagios/plugins/check_raid
CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 1, Spare: 0

Now to find the actual physical disk:

ryankemper@wdqs2007:~$ sudo lshw -class disk
[snip]
  *-disk:3
       description: ATA Disk
       product: MTFDDAK960TDN
       physical id: 3
       bus info: scsi@9:0.0.0
       logical name: /dev/sdh
       version: F003
       serial: 193023009C1B
       size: 894GiB (960GB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=c6a91063-1050-4c76-9777-d62cd262302f logicalsectorsize=512 sectorsize=4096
[snip]

So serial number 193023009C1B is the physical disk needs to be looked at / possibly replaced.

@wiki_willy I heard Papaul is out for a couple weeks so see the above comment https://phabricator.wikimedia.org/T281437#7086866

Hi @RKemper - Papaul will be back on the 24th. Would you be able to hold off until then? If not, we can submit the RMA request, then possibly use CyrusOne to do the drive swap. Thanks, Willy

In T281437#7086968, @RKemper wrote:

@wiki_willy I heard Papaul is out for a couple weeks so see the above comment https://phabricator.wikimedia.org/T281437#7086866

Dzahn merged a task: T282758: Degraded RAID on wdqs2007.May 14 2021, 7:00 PM

MPhamWMF moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.May 17 2021, 3:22 PM

In T281437#7087127, @wiki_willy wrote:

Hi @RKemper - Papaul will be back on the 24th. Would you be able to hold off until then? If not, we can submit the RMA request, then possibly use CyrusOne to do the drive swap. Thanks, Willy

In T281437#7086968, @RKemper wrote:

@wiki_willy I heard Papaul is out for a couple weeks so see the above comment https://phabricator.wikimedia.org/T281437#7086866

We've got a fairly small amount of wdqs nodes per datacenter, so in the interest of reliability I think it's worth submitting the RMA and using CyrusOne to do the swap.

I've setup SR1059928911 self dispatch for a part & a Dell Technician to swap the SSD.

The Dell Tech should contact me to schedule the on-site, when they do I'll enter a CyrusOne Escort Visitor ticket with those details and the Dell Tech can go swap the SSD. (basically treating this like i would a swap in ESAMS.)

RobH claimed this task.May 17 2021, 6:44 PM

RobH added a subscriber: Papaul.

Unisys tech Scott Jones will be going onsite tomorrow via ticket 1962822 (cyrusone escort) to swap out the SSD and return the defective part.

RobH raised the priority of this task from Medium to High.May 18 2021, 6:27 PM

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.

RobH removed subscribers: Papaul, ops-monitoring-bot.

Determining the slot # of 'sdh' was assumed to be slot 8, due to the lshw -class disk output listing a 'scsi bus id' which is really just the backplane numbering for each disk. They ranged "bus info: scsi@2:0.0.0" to bus info: scsi@9:0.0.0, listing /dev/sda through /dev/sdh, so I had the remote tech pull slot 8 (highest scsi id in the range) and it was indeed /dev/sdh with the defective SSD serial number.

I'm guessing the SFF 10 bay chassis has the missing scsi@0 and scsi@1 ids but not sure!

The defective SSD has been swapped out with a new one, but the software raid doesn't appear to auto rebuild, so I invoked it with the following:

copy the parition table of sda to sdh (sdh was replaced)
sudo sgdisk -R /dev/sdh /dev/sda

create a random guid for sdh
sudo sgdisk -G /dev/sdh

audit output of both disks to ensure they now match
sudo sgdisk -p /dev/sda
sudo sgdisk -p /dev/sdh

add the new SSD back into the array
sudo mdadm --manage /dev/md0 --add /dev/sdh2

check the status
robh@wdqs2007:~$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid10 sdh2[8] sda2[0] sde2[4] sdf2[5] sdg2[6] sdc2[2] sdd2[3] sdb2[1]

3749068800 blocks super 1.2 512K chunks 2 near-copies [8/7] [UUUUUUU_]
[>....................]  recovery =  0.0% (132736/937267200) finish=470.6min speed=33184K/sec
bitmap: 28/28 pages [112KB], 65536KB chunk

unused devices: <none>

So this will take a bit to rebuild (11 hour estimate), and I'd not put it back into 'staged' until it does. I'll keep this assigned to me and check on it again later.

RobH added a subscriber: • 1.May 19 2021, 7:12 PM

RobH removed a subscriber: • 1.

robh@wdqs2007:~$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid10 sdh2[8] sda2[0] sde2[4] sdf2[5] sdg2[6] sdc2[2] sdd2[3] sdb2[1]
      3749068800 blocks super 1.2 512K chunks 2 near-copies [8/7] [UUUUUUU_]
      [====>................]  recovery = 20.9% (196460288/937267200) finish=537.0min speed=22990K/sec
      bitmap: 28/28 pages [112KB], 65536KB chunk

unused devices: <none>

wdqs2007 raid fully rebuilt and system is online.

I set to staged in netbox, when its added back into full service by @RKemper it should be changed to active.

Dzahn awarded a token.May 21 2021, 1:47 AM

Mentioned in SAL (#wikimedia-operations) [2021-06-02T21:10:36Z] <ryankemper> T280382 T281437 sudo -i wmf-auto-reimage-host -p T280382 --new wdqs2007.codfw.wmnet on ryankemper@cumin2002 tmux session wdqs_reimage

Re-image completed successfully, and I went ahead and set the host to Active in Netbox. This is done.

hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnetClosed, ResolvedPublicRequestActions

Description

Related ObjectsSearch...

Event Timeline

hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet
Closed, ResolvedPublicRequest
Actions

Related Objects
Search...