Page MenuHomePhabricator

Replace wtp1043's sda
Closed, ResolvedPublic

Description

On wtp1043, we have an alert related to a SMART failure, but upon inspection, the SMART checks pass if we -T permissive option:

wtp1043:~# smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
wtp1043:~# smartctl -T permissive -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Short INQUIRY response, skip product id
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

... but now I see errors in the kernel logs trying to write to sda. The disk needs to be swapped out with the server powered off. No reimaging is needed, just:

  • poweroff
  • swap the disk
  • poweron
  • handoff to services-ops for validation and repooling.

Event Timeline

Joe created this task.Jun 11 2018, 10:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2018, 10:55 AM
Joe renamed this task from SMART checks fail on wtp1043's sda to Replace wtp1043's sda .Jun 18 2018, 10:50 AM
Joe triaged this task as Medium priority.
Joe updated the task description. (Show Details)
Joe edited projects, added ops-eqiad, DC-Ops; removed observability.

Mentioned in SAL (#wikimedia-operations) [2018-06-18T10:52:15Z] <_joe_> removing wtp1043 from all pybal configuration until the disk is replaced T196886

Joe updated the task description. (Show Details)Jun 18 2018, 10:53 AM
Vvjjkkii renamed this task from Replace wtp1043's sda to 99aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
jrbs renamed this task from 99aaaaaaaa to Replace wtp1043's sda.Jul 1 2018, 7:17 AM
jrbs lowered the priority of this task from High to Medium.
jrbs updated the task description. (Show Details)
jrbs added a subscriber: Aklapper.
faidon added a subscriber: faidon.Jul 18 2018, 11:40 AM

What's going on with this?

disk ordered

You have successfully submitted request SR977155354.

Cmjohnson assigned this task to RobH.Jul 25 2018, 12:45 PM

@RobH Can you help with this by reaching out to our Dell Rep. the repair request was denied because the service tag shows as not belonging to our organization. (ST is 5MCLDH2). If you could have them verify all the service tags from that WTP batch.

The message I received:
Work Order: SR977155354 Denial Notes
We are unable to proceed with your request as the Service Tag is not enrolled in the TechDirect program. This service tag is registered to another party and not linked to your organization. This may be caused by an incorrectly entered service tag number. Please check the tag and create a request using the correct tag.

@RobH have you had a chance to check this?

@RobH, @Cmjohnson, this has been open for two months now -- why is this taking such a long time to resolve?

RobH added a comment.Aug 16 2018, 3:01 PM

I did not check this, just didn't notice it assigned to me. The Tech Direct doesn't work, was normal support attempted? I've emailed our team, & CCed Chris.

Dell Team,

We're experiencing a warranty support issue with server 5MCLDH2. We leased this system back on 2017-03-07 via T155645/Farnam and Dell.

We recently tried to call in a warranty support call, and warranty support is telling us that this system is NOT assigned to WMF, and thus they won't assist us with warranty support.

Can you advise why this system is not assigned to us, and what can be fixed so we can receive warranty support for this broken system.

Thanks in advance,

RobH reassigned this task from RobH to Cmjohnson.Aug 16 2018, 6:53 PM

Dell fixed the ownership info for us, you can put in requests for support and parts now.

Another dispatch was created.. SR978583381

Cmjohnson closed this task as Resolved.Aug 30 2018, 4:51 PM
faidon reopened this task as Open.Sep 10 2018, 9:37 AM

We're still getting RAID alerts about this host.

Volans added a subscriber: Volans.Sep 11 2018, 9:55 AM

Please revert https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/459607 once this has been fixed and the host is back in the pool.

Cmjohnson closed this task as Resolved.Sep 12 2018, 4:13 PM

I do not see any raid alerts in icinga...resolving

Dzahn added a subscriber: Dzahn.Sep 12 2018, 4:18 PM

What about the revert? was it done? It doesn't look like it. Creating it.

Dzahn reopened this task as Open.Sep 12 2018, 4:20 PM
RobH added a comment.Sep 13 2018, 4:27 PM

I'm not sure why this is still pending repair after all this time. In checking on the system, I can see it has both SDA and SDB present. SDA is marked as failed across both md0 and md1.

root@wtp1043:/dev# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Jun 30 17:26:49 2017
     Raid Level : raid1
     Array Size : 48794624 (46.53 GiB 49.97 GB)
  Used Dev Size : 48794624 (46.53 GiB 49.97 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

    Update Time : Thu Sep 13 16:25:43 2018
          State : clean, degraded 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : wtp1043:0  (local to host wtp1043)
           UUID : dc41e061:a120a483:0ebe54b1:ee160af6
         Events : 3730543

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1
root@wtp1043:/dev# mdadm -D /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Fri Jun 30 17:26:49 2017
     Raid Level : raid1
     Array Size : 927802368 (884.82 GiB 950.07 GB)
  Used Dev Size : 927802368 (884.82 GiB 950.07 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Jun 30 17:36:10 2017
          State : clean, degraded, resyncing (PENDING) 
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : wtp1043:1  (local to host wtp1043)
           UUID : 0af5e6ba:cf2bf119:1690de31:6ee3eed8
         Events : 3

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2
root@wtp1043:/dev#

However, fdisk -l doesn't show SDA?

Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xa8d4cc9f

Device     Boot    Start        End    Sectors  Size Id Type
/dev/sdb1           2048   97656831   97654784 46.6G fd Linux raid autodetect
/dev/sdb2       97656832 1953523711 1855866880  885G fd Linux raid autodetect


Disk /dev/md0: 46.5 GiB, 49965694976 bytes, 97589248 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/md1: 884.8 GiB, 950069624832 bytes, 1855604736 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/wtp1043--vg-swap: 952 MiB, 998244352 bytes, 1949696 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/wtp1043--vg-_placeholder: 883.9 GiB, 949070331904 bytes, 1853652992 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@wtp1043:/dev#

Was the SDA replaced in this system?

lshw doesn't see sda:

root@wtp1043:/dev# lshw -class disk
  *-disk                    
       description: SCSI Disk
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       configuration: logicalsectorsize=512 sectorsize=512
  *-disk
       description: ATA Disk
       product: ST1000NX0423
       vendor: Seagate
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: NA03
       serial: W4705RJ4
       size: 931GiB (1TB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=a8d4cc9f
root@wtp1043:/dev#

Dell denied my request becuase they say the h/w log doesn't show a disk failure

We are unable to proceed with this request as the provided log does not reflect any Hard Drive failure. Please verify the service tag and submit the request for correct tag.

I also see this in the output pasted above .

Failed Devices : 0

RobH added a comment.EditedSep 13 2018, 5:03 PM

Summary:

SDA was never swapped out for RMA.

We're reimaging this in an attempt to troubleshoot the fact it did not see SDA in the OS.

Rebooted, and @Cmjohnson swapped SDA and SDB. The defective SDA disk is now located in SDB slot.

Checked bios, it sees both 1TB SATA disks just fine in both slots. Proceeding with reimage.

If the SDB fails, then we know its a bad disk. If SDA fails, its a bad slot (as sda now has the non failed sdb disk in it.)

Ok, reimage has completed and OS is running with puppet run already done.

No errors logged so far, leaving this open and stalled checking for errors regularly. Since eqiad is depooled, its not under heavy load. Assigning to Mortiz to place back in service when we move back over.

Dzahn added a comment.Sep 13 2018, 5:46 PM

this should be merged please once it's been confirmed the host is ok again:

https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/460049/

Mentioned in SAL (#wikimedia-operations) [2018-09-17T07:46:18Z] <moritzm> repooled wtp1043 (T196886)

@mortzm Is it safe to say that this can be resolved? Thanks!

Chris

Dzahn added a comment.Oct 2 2018, 7:51 PM

It's green again now.

https://gerrit.wikimedia.org/r/#/c/operations/cookbooks/+/460049/

Has been merged.

<moritzm> repooled wtp1043

[wtp1043:~] $ pool
Pooling all services on wtp1043.eqiad.wmnet

Dzahn closed this task as Resolved.Oct 2 2018, 7:53 PM
Dzahn removed MoritzMuehlenhoff as the assignee of this task.
Dzahn added a subscriber: MoritzMuehlenhoff.
[wtp1043:~] $ sudo smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED