Check for errors on wdqs1009 disks
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Sep 17 2020, 1:06 PM

Description

According to https://github.com/blazegraph/database/issues/52#issuecomment-280864022 checksum errors can be due to disk failures. There are no errors reported on wdqs1009 disks but there are perhaps other tools that may uncover errors present on disks.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		dcausse	T263110 Investigate the cause of: ChecksumError: offset=517789868032,nbytes=16,expected=-58390144,actual=535102966 while importing wikidata dumps
		Resolved		• Cmjohnson	T263125 Check for errors on wdqs1009 disks

Event Timeline

dcausse created this task.Sep 17 2020, 1:06 PM

• Zbyszko moved this task from Incoming to Current work on the Wikidata-Query-Service board.Sep 21 2020, 3:35 PM

• Zbyszko added a project: Discovery-Search (Current work).

CBogen set the point value for this task to 2.Sep 21 2020, 5:29 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Sep 21 2020, 5:31 PM

A historical unix program, shipped with many distros, for writing patterns to a disk and reading them back to verify correctness:

https://wiki.archlinux.org/index.php/Badblocks

In T263125#6499600, @EBernhardson wrote:

A historical unix program, shipped with many distros, for writing patterns to a disk and reading them back to verify correctness:

https://wiki.archlinux.org/index.php/Badblocks

I'll run it for a while after checking with David that we're OK nuking this system (badblocks is destructive). We'll need to reimage the server afterward, but that should not be an issue.

Gehel claimed this task.Sep 29 2020, 7:45 AM

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2020-09-29T07:55:02Z] <gehel> badblocks check on wdqs1009 - T263125

Running check via `sudo badblocks -v -o badblocks -w -f /dev/mapper/wdqs1009--vg-
data &> badblocks.log`. This is still going through the raid layer, but should exercise the disks enough to get some confidence that it works.

One error found by badblocks at block 65:

From block 0 to 1464897535
Testing with pattern 0xaa: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0x55: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0xff: done                                                 
Reading and comparing: done                                                 
Testing with pattern 0x00: done                                                 
Reading and comparing: done                                                 
Pass completed, 1 bad blocks found. (0/0/1 errors)

smart-data-dump shows a few errors for sdb:

# HELP device_smart_program_fail_cnt_total SMART attribute program_fail_cnt_total
# TYPE device_smart_program_fail_cnt_total gauge
device_smart_program_fail_cnt_total{device="sda"} 0.0
device_smart_program_fail_cnt_total{device="sdb"} 1.0
device_smart_program_fail_cnt_total{device="sdc"} 0.0
device_smart_program_fail_cnt_total{device="sdd"} 0.0
# HELP device_smart_used_rsvd_blk_cnt_tot SMART attribute used_rsvd_blk_cnt_tot
# TYPE device_smart_used_rsvd_blk_cnt_tot gauge
device_smart_used_rsvd_blk_cnt_tot{device="sda"} 0.0
device_smart_used_rsvd_blk_cnt_tot{device="sdb"} 2.0
device_smart_used_rsvd_blk_cnt_tot{device="sdc"} 0.0
device_smart_used_rsvd_blk_cnt_tot{device="sdd"} 0.0
# HELP device_smart_reallocated_sector_ct SMART attribute reallocated_sector_ct
# TYPE device_smart_reallocated_sector_ct gauge
device_smart_reallocated_sector_ct{device="sda"} 0.0
device_smart_reallocated_sector_ct{device="sdb"} 4.0
device_smart_reallocated_sector_ct{device="sdc"} 0.0
device_smart_reallocated_sector_ct{device="sdd"} 0.0

This server was purchased on April 25, 2018 and should still be under warranty.

@wiki_willy: Is this enough suspicion to replace the SSD under warranty?

In the meantime, I'll reimage that server, we'll see how it behaves.

Restricted Application added a project: SRE. · View Herald TranscriptSep 30 2020, 8:50 AM

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202009300853_gehel_3298_wdqs1009_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1009.eqiad.wmnet']

and were ALL successful.

Moving over to @RobH and @Cmjohnson, so they can pull TSR reports for a replacement part

I've pulled the TSR and submitted SR1038301301 for dispatching a replacement SSD to eqiad. This is now over to Chris for receipt and installation/return of defective SSD.

Since the failing SSD is currently working, we need to attempt to securely erase it before sending back to Dell. The directions for using hdparm are documented on https://wikitech.wikimedia.org/wiki/Dc-operations/Securely_Erasing_Media and I've checked, wdqs1009 has hdparm installed within the OS. I'm not sure how well a system will handle hdparm secure erase on 1 of 4 disks in a sw raid10, but this seems like an excellent opportunity to find out! I'm willing to remotely assist as needed!

I would recommend this system be depooled during the SSD swap, in case the secure erase (only lasts 3 minutes) causes undue loads.

@Gehel: Are there directions for [de/re]pooling of wdqs servers we can follow for this within DC-Ops?

IRC update from my chat with @Gehel:

wdqs1009 is not in production, but a non user facing test server. We can do our hdparm testing of secure erasure without affecting users. If we nuke the server, it can simply be reimaged.

When Chris is ready to swap the SSD, I'll want to login, use mdadm to pull SDB out of the sw raid10, then hdparm secure erase it. Chris can then swap the disk and I can attempt a sw raid rebuild. I'll document my steps and commands, so we can attempt to automate this in the future.

Note that it is trivial to reimage this server, so feel free to nuke it instead of rebuilding the raid if it's easier (or use this as a learning opportunity to automate it next time).

Gehel moved this task from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.Oct 5 2020, 5:05 PM

@RobH @Gehel the SSD has been replaced.

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1009.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202010080750_gehel_11476_wdqs1009_eqiad_wmnet.log.

Gehel mentioned this in T255399: Prepare wdqs1009 to run the streaming updater .Oct 8 2020, 8:09 AM

Completed auto-reimage of hosts:

['wdqs1009.eqiad.wmnet']

and were ALL successful.

Gehel mentioned this in T274788: potential disk issue on wdqs1010.Feb 15 2021, 1:46 PM

Check for errors on wdqs1009 disksClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Check for errors on wdqs1009 disks
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...