Page MenuHomePhabricator

Check for errors on wdqs1009 disks
Closed, ResolvedPublic2 Estimated Story Points

Description

According to https://github.com/blazegraph/database/issues/52#issuecomment-280864022 checksum errors can be due to disk failures. There are no errors reported on wdqs1009 disks but there are perhaps other tools that may uncover errors present on disks.

Event Timeline

CBogen set the point value for this task to 2.Sep 21 2020, 5:29 PM

A historical unix program, shipped with many distros, for writing patterns to a disk and reading them back to verify correctness:

https://wiki.archlinux.org/index.php/Badblocks

A historical unix program, shipped with many distros, for writing patterns to a disk and reading them back to verify correctness:

https://wiki.archlinux.org/index.php/Badblocks

I'll run it for a while after checking with David that we're OK nuking this system (badblocks is destructive). We'll need to reimage the server afterward, but that should not be an issue.

Mentioned in SAL (#wikimedia-operations) [2020-09-29T07:55:02Z] <gehel> badblocks check on wdqs1009 - T263125

Running check via `sudo badblocks -v -o badblocks -w -f /dev/mapper/wdqs1009--vg-
data &> badblocks.log`. This is still going through the raid layer, but should exercise the disks enough to get some confidence that it works.

One error found by badblocks at block 65:

From block 0 to 1464897535
Testing with pattern 0xaa: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0x55: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0xff: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0x00: done                                                 
Reading and comparing: done                                                 
Pass completed, 1 bad blocks found. (0/0/1 errors)
Gehel added projects: DC-Ops, ops-eqiad.
Gehel added a subscriber: wiki_willy.

smart-data-dump shows a few errors for sdb:

# HELP device_smart_program_fail_cnt_total SMART attribute program_fail_cnt_total
# TYPE device_smart_program_fail_cnt_total gauge
device_smart_program_fail_cnt_total{device="sda"} 0.0
device_smart_program_fail_cnt_total{device="sdb"} 1.0
device_smart_program_fail_cnt_total{device="sdc"} 0.0
device_smart_program_fail_cnt_total{device="sdd"} 0.0
# HELP device_smart_used_rsvd_blk_cnt_tot SMART attribute used_rsvd_blk_cnt_tot
# TYPE device_smart_used_rsvd_blk_cnt_tot gauge
device_smart_used_rsvd_blk_cnt_tot{device="sda"} 0.0
device_smart_used_rsvd_blk_cnt_tot{device="sdb"} 2.0
device_smart_used_rsvd_blk_cnt_tot{device="sdc"} 0.0
device_smart_used_rsvd_blk_cnt_tot{device="sdd"} 0.0
# HELP device_smart_reallocated_sector_ct SMART attribute reallocated_sector_ct
# TYPE device_smart_reallocated_sector_ct gauge
device_smart_reallocated_sector_ct{device="sda"} 0.0
device_smart_reallocated_sector_ct{device="sdb"} 4.0
device_smart_reallocated_sector_ct{device="sdc"} 0.0
device_smart_reallocated_sector_ct{device="sdd"} 0.0

This server was purchased on April 25, 2018 and should still be under warranty.

@wiki_willy: Is this enough suspicion to replace the SSD under warranty?

In the meantime, I'll reimage that server, we'll see how it behaves.

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009300853_gehel_3298_wdqs1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1009.eqiad.wmnet']

and were ALL successful.

wiki_willy added a subscriber: RobH.

Moving over to @RobH and @Cmjohnson, so they can pull TSR reports for a replacement part

I've pulled the TSR and submitted SR1038301301 for dispatching a replacement SSD to eqiad. This is now over to Chris for receipt and installation/return of defective SSD.

Since the failing SSD is currently working, we need to attempt to securely erase it before sending back to Dell. The directions for using hdparm are documented on https://wikitech.wikimedia.org/wiki/Dc-operations/Securely_Erasing_Media and I've checked, wdqs1009 has hdparm installed within the OS. I'm not sure how well a system will handle hdparm secure erase on 1 of 4 disks in a sw raid10, but this seems like an excellent opportunity to find out! I'm willing to remotely assist as needed!

I would recommend this system be depooled during the SSD swap, in case the secure erase (only lasts 3 minutes) causes undue loads.

@Gehel: Are there directions for [de/re]pooling of wdqs servers we can follow for this within DC-Ops?

IRC update from my chat with @Gehel:

wdqs1009 is not in production, but a non user facing test server. We can do our hdparm testing of secure erasure without affecting users. If we nuke the server, it can simply be reimaged.

When Chris is ready to swap the SSD, I'll want to login, use mdadm to pull SDB out of the sw raid10, then hdparm secure erase it. Chris can then swap the disk and I can attempt a sw raid rebuild. I'll document my steps and commands, so we can attempt to automate this in the future.

Note that it is trivial to reimage this server, so feel free to nuke it instead of rebuilding the raid if it's easier (or use this as a learning opportunity to automate it next time).

@RobH @Gehel the SSD has been replaced.

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010080750_gehel_11476_wdqs1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1009.eqiad.wmnet']

and were ALL successful.