potential disk issue on wdqs1010
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	Gehel
	Feb 15 2021, 1:46 PM

Description

Data reload on wdqs1010 (T267927) failed with journal corruption, which could look like disk error. SMART data does not show and failed IO. More investigation with badblocks (as done in T263125) would make sense.

AC:

we know if the disks are good or not
ticket is created to replace the disks if needed

Details

	Subject	Repo	Branch	Lines +/-
	reimage: raid0.default_layout=2 for all installers	operations/puppet	production	+16 -57

Customize query in gerrit

Related Objects

Mentioned In: T267927: Reload wikidata journal from fresh dumps
Mentioned Here: T263125: Check for errors on wdqs1009 disks
T267927: Reload wikidata journal from fresh dumps

Event Timeline

Gehel created this task.Feb 15 2021, 1:46 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptFeb 15 2021, 1:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Gehel mentioned this in T267927: Reload wikidata journal from fresh dumps.Feb 15 2021, 1:46 PM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Feb 15 2021, 4:08 PM

Gehel added a project: Discovery-Search (Current work).

Gehel updated the task description. (Show Details)Feb 15 2021, 4:34 PM

Gehel set the point value for this task to 2.

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel claimed this task.Feb 17 2021, 1:41 PM

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Running sudo badblocks -v -o badblocks -w -f /dev/mapper/vg0-srv &> badblocks.log on wdqs1010

block 65 seems to be bad and kernel is throwing a number of errors similar to:

[702641.004691] EXT4-fs error (device dm-2): ext4_map_blocks:614: inode #38928414: block 155721773: comm git: lblock 0 mapped to illegal pblock 155721773 (length 1)
[702641.021024] EXT4-fs warning (device dm-2): htree_dirblock_to_tree:995: inode #38928414: lblock 0: comm git: error -117 reading directory block
[704406.562031] EXT4-fs error (device dm-2): ext4_map_blocks:614: inode #38993923: block 155722274: comm git: lblock 0 mapped to illegal pblock 155722274 (length 1)

SMART does not report any errors.

Now let's try to find out which disk is problematic.

After discussion with @Volans:

We see no errors at hardware level. Badblocks does find a bad block, but going through mdraid creates enough levels of indirections that this might not be hardware related. Filesystem is corrupted.

At this point, we can try blaming the error on cosmic rays, reimage to get into a sane configuration and hope it does not happen again. There is no certainty that this will resolve the issue for good, but it seems a good compromise, not spending too much time investigating further at the moment.

Next step: reimage and copy data from another server (note that wdqs2008 is running an initial import at the moment).

Script wmf-auto-reimage was launched by gehel on cumin1001.eqiad.wmnet for hosts:

wdqs1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102191316_gehel_15559_wdqs1010_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['wdqs1010.eqiad.wmnet']

and were ALL successful.

scheduled a week of downtime, we're not ready for the data import yet (see T267927)

Change 697832 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] reimage: raid0.default_layout=2 for all installers

https://gerrit.wikimedia.org/r/697832

gerritbot added a project: Patch-For-Review.Jun 2 2021, 5:11 PM

Change 697832 merged by Ryan Kemper:

[operations/puppet@production] reimage: raid0.default_layout=2 for all installers

https://gerrit.wikimedia.org/r/697832

Maintenance_bot removed a project: Patch-For-Review.Jun 2 2021, 9:10 PM

Gehel reassigned this task from Gehel to RKemper.Jun 7 2021, 1:29 PM

RKemper moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Jun 14 2021, 3:32 PM

Gehel closed this task as Resolved.Jun 21 2021, 11:39 AM

potential disk issue on wdqs1010Closed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

potential disk issue on wdqs1010
Closed, ResolvedPublic2 Estimated Story Points
Actions